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Summary 


A parallel algorithm has been developed for rapidly solving trajectory optimization problems. 
The goal of the work has been to develop an algorithm that is suitable to do real-time, on-line 
optimal guidance through repeated solution of a trajectory optimization problem. The algorithm 
has been developed on an INTEL iPSC/860 message passing parallel processor. It uses a zero- 
order-hold discretization of a continuous-time problem and solves the resulting nonlinear 
programming problem using a custom-designed augmented Lagrangian nonlinear programming 
algorithm. The algorithm achieves parallelism of function, derivative, and search direction 
calculations through the principle of domain decomposition applied along the time axis. It has been 
encoded and tested on 3 example problems, the Goddard problem, the acceleration-limited, planar 
minimum-time to the origin problem, and a National Aerospace Plane minimum-fuel ascent 
guidance problem. Execution times as fast as 1 1 8 sec of wall clock time have been achieved for a 
128-stage Goddard problem solved on 32 processors. A 32-stage minimum-time problem has 
been solved in 151 sec on 32 processors. A 32-stage National Aerospace Plane problem required 
2 hours when solved on 32 processors. A speed-up factor of 7.2 has been achieved by using 32- 
nodes instead of 1-node to solve a 64-stage Goddard problem. 
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1. Introduction 


1.1 Review of Project Objective 

This grant's goal has been to achieve nonlinear optimal feedback control through repeated, 
on-line solution of trajectory optimization problems in real-time. Such control could be used for 
aerospace vehicle guidance, such as National Aerospace Plane (NASP) ascent guidance. The 
primary effort of this grant has been to develop a fast trajectory optimizer that can solve problems 
which include inequality constraints. The optimizer should be able to update solutions about once 
every 5 seconds. General parallel algorithms have been developed to try to meet this goal. They 
have been developed and tested on INTEL iPSC/2 and iPSC/860 distributed-memory parallel 
processors, but they are applicable to any highly-parallel distributed-memory machine (e.g. the 
INTEL Touchstone system or Transputer-based systems). 

1.2 Summary of 3 1/2-Year Grant Activities 

The activities under this grant can be broken into 5 main categories: 

1 . Development of a parallel solver for dynamic quadratic programs (QPs). 

2 . Development of a robust serial nonlinear programming algorithm (NP). 

3. Development of a specification and interface for trajectory optimization 
problem encoding, including approximation of continuous-time phases by 
discrete-time phases. 

4. Integration of the parallel QP algorithm with the serial NP algorithm to 
produce a parallel trajectory optimization algorithm. 

5. Modelling, encoding, and testing of example trajectory optimization 
problems. 

The first year of the project was devoted primarily to developing a parallel dynamic quadratic 
programming algorithm. The second year's effort was split between developing a refined parallel 
QP algorithm and developing a robust serial NP algorithm. During these first two years Kihong 
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Park, the project's graduate research assistant, concentrated on the parallel QP work. Prof. Mark 
Psiaki, the project's principal investigator, helped guide Park, and he developed the nonlinear 
programming algorithm. 

Park spent the final year and a half of the project working to parallelize the serial nonlinear 
programming algorithm. This included integration of it with the parallel QP algorithm, which 
calculates NP search directions, and with parallel function, gradient, and Jacobian software that 
was developed by Psiaki. 

Psiaki spent the last year and a half of the grant developing the specifications and software 
that allow the submission of different problems to the main algorithm. He also spent time 
modeling and encoding three test problems: a Goddard problem, a linear tangent steering 
minimum-time problem, and a NASP minimum-fuel ascent guidance problem. 

1.3 Outline of Report 

The remainder of this report is divided into 6 Sections plus conclusions and 5 Appendices. 
Section 2 summarizes the results of work that has been done to parallelize the computation of 
trajectory optimization search directions based on first and second gradient information. Appendix 
A contains two papers that give the details of this work. Section 3 summarizes the nonlinear 
programming algorithm that has been developed to serve as the heart of the trajectory optimization 
algorithm, and Appendix B contains a paper that describes the algorithm in more detail. Section 4 
gives an overview of how the nonlinear programming algorithm has been augmented with parallel 
derivative calculations and integrated with the parallel search direction algorithm to yield the final 
parallel nonlinear trajectory optimization algorithm. Appendix C contains a paper that fills in some 
of the algorithm details omitted in Section 4. 

Section 5 presents three example trajectory optimization problems, the Goddard sounding 
rocket problem, the minimum-time, acceleration-limited particle-in-a-plane problem, and a NASP 
ascent guidance problem. Section 6 presents parallel computational results for these three 
problems. Section 7 makes observations about the algorithm based upon computational 
experience, and it suggests possible improvements. Section 8 presents the conclusions. 
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Appendix D presents a specification document that explains how to encode of a trajectory 
optimization problem so that it can be linked to the parallel trajectory optimization algorithm. The 
code that models the planar minimum-time problem is included at the end of Appendix D as an 
example of how to conform to the problem encoding specification. 

Appendix E outlines the theory of a multi-dimensional spline technique that has been 
developed as a by-product of this research. Multi-dimensional splines have been used on the 
NASP problem, and the procedure on Appendix E presents a relatively simple technique to carry 
out rapid, on-line multi-dimensional spline calculations that have continuous second partial 
derivatives of the interpolated function. 

2. Dynamic Quadratic Programming/Search Direction Calculation on a Parallel 

Processor 

This part of the effort concentrated on developing an efficient parallel solver for a problem of 
the form: 


T 


find: 



(la) 

to minimize: 

N 

J ~ { 2^Sc ^ kX k 8 k^k 1 


(lb) 


k=0 



subject to: 

Ek Xk + ek + Fk+i Xk+i = 0 

for k = 0 ... N-l 

(lc) 


D k x k + d k = 0 

for k = 0 ... N 

(Id) 


The solution vector at a given stage, Xk> is usually a combination of a state vector, Xk, and a 
control vector, u^. The vector could also include slack variables if required by the NP algorithm 
that uses the QP solver. 

The constraints in eq. (lc) are dynamic constraints that link stages. Not that the form is 
general enough to admit the standard linear difference equation model of a linear time-varying 
discrete-time system. The nonhomo geneous term in eq. (lc), ek» is included to make the QP 
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algorithm compatible with NP trajectory optimization algorithms that do not satisfy the dynamics at 
intermediate stages of the solution process. 

The auxiliary constraints in eq. (Id) allow the modeling of state and control constraints that 
apply point-wise in time, such as bounds on the throttle setting or on the dynamic pressure. Even 
though the constraints in. eq. (Id) are equality constraints, they are useful for dealing with 
inequality constraints if an active set strategy is employed. An active set strategy assumes that 
some of the inequality constraints are active and others are inactive. The active inequalities are 
enforced as equalities, and the inactive constraints are ignored. Of course, logic must be employed 
to make changes to the active set assumptions as needed. In the present effort, this logic is part of 
the overall NP algorithm, not part of the QP algorithm. 

The QP algorithm that solves eqs. (la)-(ld) gets used by the nonlinear trajectory optimization 
algorithm to determine a search direction in state and control time-history space. It is used by the 
NP algorithm in a sequential quadratic programs sort of approach: the constraints are linearized 
about the current solution estimate and the Lagrangian function (cost plus multipliers times 
constraints) is approximated by a quadratic expansion about the current solution estimate. This 
yields a quadratic program, which is solved for a search direction. The solution is modified by 
moving along this search direction in state and control time history space. A merit function is used 
to determine the step length. This process ensures global convergence in many cases and quadratic 
convergence near the solution, where the quadratic program is a good approximation of the original 
nonlinear program. 

A nonlinear trajectory optimization algorithm can call for hundreds or thousands of QP 
solutions during one solution for an optimal trajectory. Therefore, it is extremely important that the 
QP be solved rapidly. In order to accomplish this, the domain decomposition approach has been 
used to parallelize the QP solution procedure. 

The basic idea of domain decomposition is to split the problem into smaller problems and 
partially solve the smaller problems. The unsolved parts of the problem are then joined into 
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progressively bigger portions that admit more partial solution to be accomplished until the entire 
problem is finally solved. 

The domains that are split to solve eqs. (la)-(ld) are time domains. The problem is split into 
a number of groups of stages. Each group is a set of contiguous stages. Any such set of stages 
can be partially optimized independently of any other such set if one assumes that the state vector 
on the boundary of any two such sets must be held fixed during the partial optimization process. 
After these partial optimizations, the remaining unsolved parts of contiguous sets of stages are 
aggregated into a smaller number of larger sets of contiguous stages. This frees some state vectors 
that had been on set boundaries, allowing them to be optimized in the next partial optimization 
cycle. 

The final parallel QP algorithm achieves good speed-up when run on a parallel processor. It 
can solve an N-stage dynamic QP problem on p (<N) processors in wall clock time that scales as 
(N/p) + Log2(p). Two versions of the algorithm and their computational timing results are 

described in detail in the two papers that have been included as Appendix A of this report. The 
second paper in the Appendix A describes the algorithm that actually has been used as part of the 
overall nonlinear programming algorithm. 

3. Nonlinear Programming Algorithm 

Once approximated in a discrete-time form, most trajectory optimization problems can be 
stated in a general nonlinear programming form: 


find: x (2a) 

to minimize: J(x) (2b) 

subject to: c e (x) = 0 (2c) 

Cj(x) < 0 (2d) 


where x is a large vector that includes the entire discrete-time state and control time histories, the 
J(x) cost function is a summation over stage-wise costs, c e (x) is a large vector of equality 
constraint functions that includes, among other things, the dynamic difference equations, and the 
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q(x) inequality constraints may include state and control auxiliary constraints at each stage. If N 
is the number of discrete-time time steps (or stages), then the dimensions of the vectors x, c c (x), 
and cj(x) include a factor of N. 

While an efficient trajectory optimization algorithm must take advantage of special problem 
structure that is not apparent in eqs. (2a)-(2d), a good algorithm must also employ generic 
nonlinear programming techniques that apply to problems of this general form. Therefore, an 
effort has been made to develop a good algorithm for solving such problems. 

The basic NP algorithm that is the heart of the a real-time trajectory optimizer must have 
several important characteristics. First, it must be robustly convergent. In other words, it must be 
very likely to converge to a feasible local minimum even from a poor first guess. Second, it must 
have fast local convergence: it must rapidly determine the solution when it is near the solution. 
Third, it must have good global convergence speed. It must progress rapidly from being "far 
away" from the solution to being "near" the solution. The above requirements constitute a "tall 
order" for any general NP algorithm. 

The algorithm developed under this grant is a sequential quadratic programs-type 
implementation of the augmented Lagrangian nonlinear programming algorithm. It uses a shifted 
penalty function that is capable of achieving exact satisfaction of equality and inequality constraints 
without requiring infinite penalty weights. Several features have been added to the algorithm in 
hopes of speeding convergence. One is the use of constraint curvature directly in the quadratic 
sub-problem. Another is the use of a special constrained form of the QP subproblem that allows 
the use of large penalty weights, which speed local convergence. The box trust region technique 
has been incorporated into the algorithm in order to guaranteed convergence to a local minimum of 
the augmented Lagrangian function, which translates into global convergence to a problem solution 
as long as the algorithm does not converge to an infeasible point near a non-zero local minimum of 
the norm of the constraint violations. 

The algorithm was first encoded and tested in MATLAB as a serial algorithm. In the paper 
that has been included as Appendix B of this report, the NP algorithm is described, and it is 
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compared with NPSOL on several test problems. It performs well on the test problems, and it has 
more robust convergence than NPSOL version 4. 

Another, slightly different version of this algorithm has been developed which is not 

described in Appendix B. The modified version has three principal differences from the algorithm 

described in appendix B. All of these have to do with how the algorithm’s "inner" loop solves a 

quadratically-constrained quadratic program. First, the alternate algorithm uses curved line 

searches on every search step. Second, it does an inexact one-variable minimization of the 

augmented Lagrangian function during a parabolic line search using a golden section search with a 

secant-method-type acceleration of terminal convergence. Third, it uses a different curvature 

correction than that described in eqs. (lla)-(llc) of the Appendix B. Instead of Finding the 
correction that minimizes j 8x^8x as in eq. (11a), it finds the correction that minimizes — 8x^Jt8x 

where H is computed as in eq. (9) of Appendix B using the multipliers from eq. (12) of Appendix 
B. This H gets modified by adding positive numbers to its diagonal if it would otherwise yield an 
indefinite projected Hessian during the solution of the modified version of Appendix B's eqs. 
(lla)-(llc). 

This modified algorithm also has been tried on the test problems described in Appendix B. It 
yielded significantly more rapid convergence on most of the problems by reducing the total number 
of line searches required to solve the several QPs that arise during a given NP run. 

4. Parallel Nonlinear Trajectory Optimization Algorithm 
4.1 Algorithm Overview 

The parallel nonlinear trajectory optimization algorithm is a special version of the nonlinear 
programming algorithm that is described in Section 3 and Appendix B. The special version 
exploits parallelism and the special dynamic programming problem structure in order to speed up 
the algorithm when solving trajectory optimization problems. 
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The algorithm has been encoded to solve continuous-time problems of the form 


find: 

u(t) and x(t) for to < t < tf 
tf 

(3a) 

to minimize: 

J = j L[x(t),i/(t),t] dt + V[x(tf)] 
to 

(3b) 

subject to: 

x(to) given 

(3c) 


x =/Ix(t),u(t),t] 

(3d) 


a e [*(t),a(t),t] = 0 

(3c) 


<*i[x(t),u(t),t] ^ 0 

(3f) 


o 

II 

4— * 

'W'' 

X 

‘-c 

0) 

Q 

(3g) 



(3h) 


which it first approximates as a discrete-time problem through a zero-order- hold approximation of 
the control time history. The algorithm also has the option of directly solving discrete-time 
problems of the form 


fmd: 

X = 

r T T T T T TT 

U 0 , X J, U J, X 2 ,..., u N .J, x N 


(4a) 

to minimize: 

N-l 

J = ^ L k (x k ,u k ) + V[x N ] 

k=0 


(4b) 

subject to: 

x 0 given 


(4c) 


Xk + 1 

= f k (x k ,u k ) 

for k = 0 ... N-l 

(4d) 


ae k (x k ,u k ) = 0 

for k = 0 ... N-l 

(4c) 


ai k (x k ,u k ) < 0 

for k = 0 ... N-l 

(4f) 


a eN (x N ) = 0 


(4g) 


a iN( x N) — ® 


(4h) 


The algorithm also can solve mixed discrete-time/continuous-time problems in which any given 
phase is either totally discrete-time or totally continuous time. Discrete-time phases are allowed to 
begin and end with different numbers of state vector elements. Through clever problem modeling 
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tricks, this feature allows for optimization of initial conditions. Other modeling tricks allow the 
code to solve free-end-time problems. 

The specialization of Section 3's basic NP algorithm takes two main forms. First, the NP 
algorithm’s linear algebra is performed using the parallel dynamic quadratic programming 
algorithm that is the subject of Section 2 and Appendix A. Second, the cost and constraint 
functions (including the dynamics model), the cost gradients, the constraint Jacobians, and the cost 
and constraint Hessians are evaluated in parallel. 

Parallel evaluation of problem functions and their derivatives is possible because the problem 
can be split in a stage- wise manner. The cost, the differential equations, and their derivatives at 
time step 2 and at time step 56 can be evaluated simultaneously because they do not affect each 
other at intermediate solution estimates. One feature of the basic NP algorithm that contributes to 
this separability is its permission of constraint violations at intermediate solutions. In particular, 
the dynamics need not be exactly satisfied until the final solution is reached, which is why stage-2 
functions and derivatives are independent of stage-56 functions and derivatives. 

Of course, functions and their derivatives at different stages are related at the final solution. 
This relationship is achieved by the NP algorithm during its search for the problem solution. The 
algorithm’s mechanism for achieving the necessary relationships is its search direction computation 
via solution of a dynamic QP. The dynamic QP solver transmits function and derivative 
information between stages. 

The algorithm has been fully encoded and tested using a 32-node INTEL iPSC/860 message- 
passing parallel processor. The results of this testing are described below in Section 6. The 
algorithm itself is described in more detail in the paper contained in Appendix C. 

4.2 Problem Encoding 

The algorithm has been encoded to solve problems that are modeled via two FORTRAN 77 
subroutines, each of which has a pre-specified argument list and functionality. One subroutine 
models the problem's cost function and dynamics equations. The other subroutine models any 
auxiliary constraints such as bounds on state or control variables. The algorithm needs the names 
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of these user-defined subroutines along with user-defined data that gets input to the main trajectory 
optimization algorithm via several array and scalar arguments. 

The necessary input data and subroutines are described in a specification document that has 
been included as the first part of Appendix D. This document makes it possible for people to use 
the parallel trajectory optimization algorithm even though they are not intimately familiar with its 
inner workings. Also included in Appendix D is the code that has been used to model the 
minimum- time to the origin problem, which is described below in Section 5.2. 

As proof of the usefulness of the problem modeling specification. Prof. Psiaki was able to do 
all of the problem modeling and problem encoding for the examples described below in Section 5, 
despite the fact that he never saw any of the parallel trajectory optimization code. He was able to 
encode the problems simply by adhering to the specification in Appendix D. 

5. Modeling of Three Example Trajectory Optimization Problems 
5.1 Goddard Problem (Maximizing the Final Altitude of a Sounding Rocket) 

The following problem maximizes the terminal altitude of a sounding rocket in flight over a 
spherical, non-rotating Earth [1]: 


find: 

tf and T(t) for 0 < t < tf 

(5a) 

to minimize: 

J = - h(t f ) 

(5b) 

subject to 

"h = v 

(5c) 


T - D(v,h) 1 
m h 2 

(5d) 


T 

m = - 03 

(5e) 


h(0) = 1, v(0) = 0, m(0) = 1 

(5f) 


0 < T(t) < 3.5 

(5g) 


m(tf) = 0.6 

(5h) 


where h is the distance of the rocket from the center of the Earth measured in Earth radii, v is the 
velocity in Earth radii per Herg (2 ji Hergs is the period of a circular orbit at the Earth's surface), m 
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is the rocket mass nondimensionalized by its initial mass, T is the rocket thrust nondimensionalized 
by the rocket's initial weight at the Earth's surface, D(v,h) is the velocity- and altitude-dependent 
aerodynamic drag, and 0.5 is the exhaust velocity of the burned rocket fuel. Note that the two 
inequalities in (5g) put minimum and maximum limits on the control, T. The drag function 
assumes a constant, zero-lift drag coefficient and an exponential atmospheric density 

D(v,h) = 6,200 e 50 °(l-h) v Ivl C D (6) 

where Cd = 0.05. 

In order to model this free-end-time problem in a fixed-end-time format, the problem has 
been split into two phases, an extra state variable has been added, and an artificial problem time has 
been introduced whose fixed terminal value is 1 . Call the artificial problem time x. Then actual 
time is t = tf t. 

Given an N+l-stage discrete-time approximation of the problem, stages 0 to N with stage N 
being the terminal stage, the problem has been re-modeled in the form 


T(t) for 0 < x < 1 and U2 for 0 < x < (1/N) 


to minimize: J= - h(l) 


subject to 


u 2 v 


for 0 < x < (1/N) 


x 4 v for (1/N) < x < 1 

^ f ° r °^ <(i/N> 

dT jx 4 [ T - D(v.h) j_] f 0 r (1 /N)<tSl 

"u 2 [- ttt] for 0 £ x < (1/N) 
dm J L ' - 1 

dt [ X4 [- 03 - ] for (1/N) - X - 1 
dx4 fN u 2 for 0 < x < (1/N) 

dx (0 for (1/N) < x < 1 

h(0) = 1, v(0) = 0, m(0) = 1 
0 < T(x) < 3.5 
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m(l) = 0.6 


(7i) 


One can understand the relationship between this formulation and the original formulation by 
noting that X 4 (x) = tf for (1/N) < x < 1. Due to the zero-order-hold approximation of the control 
time history and the N+l -stage approximation, the algorithm assumes that U 2 (x) is constant over 
the interval 0 < x < (1/N). Therefore, U 2 (x) also equals tf on this interval. Thus, all of the 
differential equations are re-scaled by tf, which correctly accounts for the modified independent 
variable, x. 

Once in this problem form, it is straight-forward to encode the problem in accordance with 
the specification in Appendix D. This has been done, and computational results are presented in 
Section 6.1 below. In all of the runs reported in Section 6.1, the first guess of the optimal 
trajectory is approximately 

tf = 0.2 = X 4 = U 2 ( 8 a) 


T(t) = 


2 for 0 < t < 0. 1 
0 for 0.1 < t < 0.2 


( 8 b) 


1 + 0.5t 2 for 0 < t < 0. 1 


h(t) = j 



0.5t 2 for 0. 

l 0.99 + 0.2t - 

| 

' t 

for 

0 < t < 0.1 

v(t) = \ 

. 0.2 - t 

for 

0.1 < t < 0.2 


f 1 - 4t 

for 

0 < t < 0.1 

m(t) = 

0.6 

for 

0.1 < t < 0.2 


( 8 c) 

( 8 d) 

( 8 e) 


Which corresponds to a flat Earth, zero drag assumption and a control policy that keeps T at 2 until 
all of the fuel is used up. Note that this first guess does not satisfy the dynamic constraints 
because it neglects the drag term and models 1 /h 2 as a constant. 
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5.2 Acceleration-Limited Minimum Time to the Origin 

The following problem minimizes the time to bring a particle to rest at the origin in the plane 
subject to a magnitude limit on its acceleration: 


find: 

to minimize: 
subject to 


tf and u(t) for 0 < t < tf 
J= t f 



rO 0 1 On 


r° 

01 


0 0 0 1 


0 

0 

X = 

0 0 0 0 

X + 

1 

0 


-0 0 0 0- 


-0 

1 - 


(9a) 

(9b) 

(9c) 


x(0) (* 0) given 
u T (t)u(t) < 1.0 
x(tf) = 0 


(9d) 

(9e) 

(9f) 


where the first two elements of the state vector x are positions, its third and fourth elements are 
velocities, and the two components of u are accelerations. 

The solution to this problem is the well-known bilinear tangent steering control law [2]: 


u(t) = 


cos 0(t) 
[_sin 0(t). 


( 10 ) 


where 0(t) satisfies an equation of the form 

3l + 32t 

tan 0(t) = - pl — 


( 11 ) 


33 + 34t 

with unknown constants 3l. 32. 33. and 34- These constants are unique up to a common scale 
factor, and they must be numerically determined simultaneously with tf by solving a set of 
algebraic equations that enforce the terminal constraint, eq. (9f). 

An easy way to generate known solutions to this problem is to pick 3l» 32. 33. 34. an d tf 
and integrate backwards in time from x(tf) = 0 for tf seconds to generate x(0). This has been 
done to generate an example problem for testing the trajectory optimization algorithm described in 
this report. The resulting initial condition is 
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x(0) = 


( 12 ) 


"-73.605383" 

-35.125486 
2.949018 
- 0.430989- 

Several modeling tricks have been used to get this problem into a form suitable for encoding 
in accordance with the specification in Appendix D. It is desirable to have the acceleration time 
history be piecewise linear rather than piecewise constant, which is what the zero-order-hold 
assumption would yield. Therefore, it is necessary to augment the optimization state vector to 
include the two accelerations and to create a new optimization control vector that consists of the 
two rates of change of the two accelerations rather than the two accelerations themselves. 

As in the Goddard problem of Section 5.1, the state vector is further augmented to allow for 
the solution of this free-end-time problem with the fixed-end-time algorithm. The new state stores 
the end time tf. 

A further modeling trick is needed to allow optimization of the initial acceleration states in 
addition to optimization of the rates of change of the accelerations. This is accomplished by 
modeling the first stage as a discrete-time stage. It does not take up any actual time, but it 
initializes the state vector for the remaining continuous-time steps. Thus, for an N+l stage 
problem, the first stage, stage 0, is modeled by the following difference equation and constraints 
(for k = 0): 


Xk+1 


-73.605383- 

-35.125486 

2.949018 

0.430989 


uki 2 + uk 2 2 ^ 1 
0.01 < uk 3 


(13a) 

(13b) 

(13c) 


where Xk (not shown) is a zero-dimensional vector, Xk+i is a 7-dimensional vector, and Uk is a 


3-dimensional vector. Elements 1 and 2 of Xk+i are the initial positions, elements 3 and 4 are the 
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initial velocities, and elements 5 and 6 are the initial accelerations. Notice how elements 5 and 6 of 
Xk+i are set equal to elements 1 and 2 of Uk, which allows optimization of the initial accelerations. 
The third element of uk is the problem's final time, uk 3 = tf, which becomes the 7th element 


of Xk+i at the end of this stage. As with the Goddard problem, the algorithm works in artificial 
time x that has a fixed terminal value of x = 1. Thus, the relationship between this artificial time 
and the real problem time is t = tf x. The extra constraint in eq. (13c) is added for the practical 
purpose of preventing the algorithm from thinking that it can run time in reverse. 

Stages 1 through N-l of the problem are modeled by the following continuous-time cost 
integral, differential equation, and constraints 
l 


J = J X7dx 


d 


dx j 
— = x 7 < 

dx 


tT" 0010000 
0001000 
0 0 0 0 1 0 0 

0 0 0 0 0 1 0 

0 0 0 0 0 0 0 

0 0 0 0 0 0 0 

^ L 0 0 0 0 0 0 0 J 


+ 


X5 2 (x) + X6 2 (x) < 1 


0 

°H 

'N 

0 

0 


0 

0 


0 

0 

U 

1 

0 


0 

1 


- 0 

0- 

j 


(14a) 


(14b) 


(14c) 


0.01 < X7 


(14d) 


where the constraint in eq. (14d) is added for practical purposes. Even though eq. (14d) is 
redundant with eq. (13c) because dx 7 /dx = 0, this constraint is useful to help ensure reasonable 
intermediate solution estimates during the optimization process when the constraint dx 7 /dx = 0 


might be violated. 

The terminal stage, stage N, is modeled simply by the terminal constraints on the positions 
and velocities 

xi(x=l) = 0 fori = l, .... 4 (15) 

coupled with one more constraint on the acceleration magnitude 
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X5 2 (t=1) + X6 2 (t=1) ^ 1 (16) 

The code that models this problem has been included at the end of Appendix D as an example 


of how to apply the problem encoding specification at the beginning of Appendix D. The 
subroutine called FLMNTM encodes the dynamics and cost functions in eqs. (13a), (14a) and 
(14b) along with their first and second partial derivatives. The subroutine CMNTM encodes the 
auxiliary constraint functions in eqs. (13b), (13c), (14c), (14d), (15) and (16) along with their first 
and second partial derivatives. 

The first guess used for this problem takes the form 


t f = 2.980345 (17a) 

X1 (t) = -73.605383 + 2.949018t - 0.494744t 2 (17b) 

x 2 (t) = -35.125486 + 0.430989t - 0.072305t 2 (17c) 

x 3 (t) = 2.949018 -0.989489t (17d) 

X4(t) = 0.430989 -0.144610t (17e) 

x 5 (t) = - 0.989489 (17f) 

x 6 (t) = - 0.144610 (17g) 

x 7 (t) = 2.980345 (17h) 

ui(t) = 0.0 (17i) 

u 2 (t) = 0.0 (17j) 


where the notation used corresponds to the definitions associated with eqs. (14a)-(14d). Basically, 
this guess puts on the "brakes" to bring the system to rest, x 3 (tf) = X 4 (tf) = 0, but it fails to bring 
the system to the origin, xi(tf) ^ 0 and x 2 (tf) ^ 0. It satisfies the dynamics constraints and the 
inequality constraint on the acceleration magnitude. 

The subroutine MNTMIN in Appendix D sets the initial guess of the optimal trajectory. This 
subroutine also defines all of the other user-definable optimization algorithm inputs that are called 
for in the specification at the beginning of Appendix D except for N, the number of the terminal 
problem stage. The quantity N must already be known by the routine that calls MNTMIN. A main 


18 



program that calls the parallel optimization code can first make a call to MNTMIN in order to set up 
the user-defined quantities. 

MNTMIN's output arrays NUVEC, NXVEC, NECVEC, and NICVEC and its output 
scalars NUMAX, NXMAX, and NCMAX define various problem vector dimensions. The arrays 
UO and XO contain the first guess. The arrays IDT and T further define the problem model by 
indicating whether a stage is continuous-time or discrete-time and by defining the time associated 
with each problem stage. The arrays KRK and ISECGR provide signals to the optimization 
algorithm of which numerical integration routine to use for continuous-time stages and of whether 
or not analytic second derivatives are available for problem functions. The arrays DELTU and 
DELTX provide finite difference intervals for approximating any second derivatives that are not 
provided in analytic form by the user. 

5.3 NASP Minimum-Fuel Ascent Guidance Problem 

A minimum-fuel NASP ascent guidance problem can be posed to take the vehicle from just 
after lift-off (10 feet off of the ground at Mach 0.4 with a flight-path angle of 2° and a gross weight 
300,000 lb) to a circular orbit at 100 miles altitude using the minimum amount of fuel: 


find: tf and a(t), <t> a (t), and 4>r(t) for 0 < t < tf 

to minimize: J = - m(tf) 


subject to h = V siny 
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The problem model has four state variables; h is the vehicle altitude above the Earth's surface 
in ft, V is the vehicle's inertial speed in ft/sec, y is the flight-path angle in rad., and m is the mass 
in slugs. Equations (18c)-(18e) model motion in a vertical plane over a spherical, non-rotating 
Earth. They include the aerodynamic forces, lift L and drag D, the net axial thrust force due to the 
air-breathing and rocket propulsion systems, T, and the usual l/(h+Rn) 2 central gravitational force 
term. Equation (18f) models the vehicle mass decrease due to fuel consumption by the air- 
breathing and rocket propulsion systems. The control variables in the problem are a, the angle-of- 
attack in rad., <t> a , the non-dimensional fuel equivalence ratio of the air-breathing propulsion 
system, and <J) r (t), the rocket throttle setting expressed as a fraction of the maximum available 
thrust. 

Other quantities that appear in the problem model are Re, the radius of the Earth in ft, q, the 
dynamic pressure in lb/ft 2 , M, the Mach number, Cx a (<]>a>M)* the thrust coefficient of the air- 

breathing propulsion system in ft 2 , I a (<)) a ,M), the fuel specific impulse of the air-breathing 
propulsion system in sec, p(h), the atmospheric density in slugs/ft^, 8E(a,M), the trim elevon 
angle in rad, and <t>a m i n (M) and <|) arnax (M), the Mach-number dependent limits on the air-breathing 

propulsion system's fuel equivalence ratio. Some of the numerical constants that appear in the 
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problem model are the acceleration of gravity at the Earth’s surface, 32.174 ft/sec 2 , the rocket's 
maximum in vacuo thrust, 15000 lb, and the rocket's in vacuo fuel specific impulse, 444.0 sec. 

The problem's auxiliary constraints enforce various practical limits. Constraint (18h) keeps 
the dynamic pressure below 2000 psf in order to ensure structural integrity. Constraint (18i) keeps 
the maximum local heating rate below 400 BTU/sec/ft 2 . The angle-of-attack is constrained to lie 
between -1° and +12° by constraints (18j), and the trim elevon setting is constrained to lie between 
± 20° by eq. (18k). Constraints (181) enforce practical limits on the air-breathing propulsion 
system's fuel equivalence ratio. Below Mach 2, <j> a is constrained to lie between 0.0 and 2.0. 
Above Mach 2, tpa is constrained to lie between 0.05 and 5.0. Similarly, rocket thrust is limited by 
constraints (18m). Constraints (18n) limit propulsive plus aerodynamic acceleration in the velocity 
direction to be between ± 4 g's, and constraint (18o) limits the normal propulsive plus 
aerodynamic acceleration to be below 4 g's. Constraint (18p) keeps the vehicle from flying into 
the ground shortly after take-off. Constraints (18q)-(18s) enforce the final achievement of a 
circular orbit at the prescribed altitude. 

The air-breathing propulsion system model is an updated version of the model contained in 
Ref. 3. It has been supplied by the first author of Ref. 3. The model gives tabulated data for the 
thrust coefficient, Cx a (4>a.M), and the fuel specific impulse, I a (<J> a ,M), on a two-dimensional grid in 

(<(> a ,M) space. The model is discontinuous at M = 2 because of a propulsion system mode switch, 
presumably from a turbojet mode for M < 2 to a ramjet/scramjet mode for M > 2. The tabulated 
data only goes up to <j) a = 2 for M < 2, but it goes up to <f> a = 5 for M > 2, which is the reason for 
the Mach-number dependent limits placed on <}> a by constraints (181). 

Two-dimensional cubic splines of the tabulated data have been used to generate the functions 
CT a (<l>a»M) and I a (4) a ,M) for use by the trajectory optimization code. These splines are continuous 

with continuous first and second partial derivatives (except at M = 2). Because the underlying 
functions are discontinuous at M = 2, two separate cubic splines have been generated for each 
function, one that applies below Mach 2, and one that applies above Mach 2. These models are 
shown in 2-D form on Figs. 1 and 2. 
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The discontinuities in these functions could cause problems for the optimization software 
described in this report. An augmentation of the problem model has been developed to avert such 
difficulties. It enforces auxiliary constraints on Mach number. At one pre-selected time step, call it 
step number N maC h- 2 , the equality constraint M = 2 is enforced. At all time steps before 
Nmach-2 the inequality constraint M < 2 is enforced, and at all time steps after N mac h -2 the inequality 
constraint 2 ^ M is enforced. The low-Mach-number cubic splines of the functions Cx a (<t>a-M) and 

are always used at time steps before N mac h_ 2 , and the high-Mach-number cubic splines of 
the functions Cx a (<t>a>M) and I a (<J> a ,M) are always used at time steps greater than or equal to 

Nmach-2- 1° other words, the propulsion system mode switch is enforced at a particular time step. 
A clever modeling trick allows the actual time of this transition to remain free for the optimization 
process to determine. The trick uses an artificial time x to encode the problem. The Mach 2 
transition occurs at a fixed artificial time, but an augmented control quantity allows the actual time 
of this mode switch to be free. 

The rocket model used here is borrowed from Ref. 4. It assumes an in vacuo maximum 
thrust of 15000 lb, an in vacuo fuel specific impulse of 444 sec., and a nozzle area of 1 ft 2 . The 
thrust inside the atmosphere is 

T r = <t>r[ 15000 - 1.0 p(h)] (19) 

where p(h) is the atmospheric pressure in psf, which is implemented as a cubic spline function of 
altitude based on tabulated data from the 1976 U.S. Standard Atmosphere [5]. 

Combining the rocket thrust and the air-breathing system's thrust, the total vehicle thrust is 
T = qCx a (<t>a,M) + 0 r [ 15000 - 1.0 p(h)] (20) 

The total rate of fuel use in eq. (18f) consists of two terms: the first term is the fuel used by the 
air-breathing system, and the second term is the fuel used by rocket. This thrust model assumes 
that the two propulsion systems can operate simultaneously, and it lets the optimization algorithm 
determine whether it would be beneficial to do so. 

Before defining the aerodynamic model, it is necessary to define several aerodynamic 
quantities. The dynamic pressure takes the usual form 
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( 21 ) 


q = ip(h) V 2 

The p(h) density function is modeled as a cubic spline of tabulated data from Ref. 5. The Mach 
number is given by 

M = V/a(h) (22) 

where a(h) is the speed of sound in ft/sec. It is modeled as a function of the air pressure and the 
density 

a(h) = /MEnr (23) 

V p(h) 

which agrees well with the tabulated a(h) data in Ref. 5 for values of h up to 280,000 ft. Above 
this altitude the concept of a speed of sound starts to break down, but eq. (23) is still used to define 
a(h) for determination of M. This should not cause great errors because the aerodynamic and 
propulsive forces that depend on M are very small above h = 280,000 ft. 

The aerodynamic model gives the lift and drag forces, L and D, and the trim elevon setting, 
5E(a,M). Lift and drag are determined from the nondimensional trimmed lift and drag 
coefficients, C^a.M) and CD tr (oc,M): 

L = qSC Llr (a,M) (24a) 

D = qSC Dtr (a,M) (24b) 

where S is the wing reference area, 3603 ft 2 . Reference 3 provides untrimmed lift, drag, and 
pitching moment coefficients along with increments to these coefficients that depend on a, M and 
8E. Given a c.g. location expressed as a fraction of the mean aerodynamic chord, x/c, one can 
solve the pitch trim equation for the equilibrium elevon setting: 

0 = C Ma (a,M) + 2C M5E (a,M,5E) + [x/c][C La (a,M) + 2C LsE (a,M,5E)] (25) 

where CM a (a,M) and CL a (a,M) are the untrimmed pitching moment and lift coefficients, and 
C M5E (a,M,8E) and Cl 5 e (oc,M,SE) are the increments to these coefficients due to just one elevon 

(left or right). 

The trim elevon setting can be calculated as a function purely of a and M, SE = 8E(a,M), 
because the c.g. location has been modeled as being a pre-determined function of M. The function 
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x(M)/c has been chosen to make the vehicle nearly neutrally stable over much of its flight envelope 
when trimmed at lift = weight. This is true at high Mach numbers, but the aerodynamic data from 
Ref. 3 has a discontinuity at M = 1 due to a canard retraction as M increases through 1. In order to 
preserve continuity of the x(M)/c function, the vehicle is allowed to be statically unstable below 
Mach 1. The pre-programmed c.g. function in tabulated form is given in Table 1. 

Given the trim elevon setting from eq. (25), the trimmed lift and drag coefficients become 
C Ltr (a,M) = C La (a,M) + 2C L5E [a,M,5E(a,M)] (26a) 

C Dtr «x,M) = C Da (a,M) + 2C D5E [a,M,6E(a,M)] (26b) 

where Cd„ is the untrimmed drag coefficient, and Cd 5 E is the drag increment due to one elevon 

being deflected. 

The functions Cl^c^M), Co^OtiM), and 8E(a,M) are implemented as 2-dimensional cubic 
splines that interpolate between grid points in (a,M) space. As mentioned above, each of these 
functions has a discontinuity at M = 1 due to the assumption of a canard retraction as M increases 
through this value. Therefore, each of these functions has two separate cubic splines, one that 

applies for M < 1 and one that applies for M > 1. This leaves a discontinuity, which can be seen 
clearly on the plots of Cl,j.(cx,M) and CD tr (oc,M) that appear in Figs. 3 and 4. 

As with the propulsion model's discontinuities, the aerodynamic discontinuities could cause 
problems for the optimization software. Further auxiliary constraints on Mach number have been 
added to the problem statement in order to avoid such problems. At a pre-selected time step 
numbered N mac h-i, the equality constraint M = 1 is enforced. At all time steps before N mac h-i the 
inequality constraint M < 1 is enforced, and at all time steps after N mac h-i the inequality constraint 
1 ^ M is enforced. The low-Mach-number cubic splines of the functions Cl u .(<x,M), CD tr (oc,M), 

and 8E(a,M) are always used at time steps before N mac h-i, and the high-Mach-number cubic 
splines of these functions are always used at time steps greater than or equal to N mac h-i- As with 
the propulsion mode switch, a modeling trick allows the actual problem time of the switch to be left 
free despite that fact that it occurs at a fixed stage number. Of course, the stage number N mac h-i is 
chosen to be less than N mac h- 2 - 
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Several modeling tricks have been used to put the problem in a form compatible with the 
specification given in Appendix D. The state vector gets augmented to store dt/dx, the rate of 
change of actual time with problem time, and some auxiliary control variables, or constraints, or 
both are required at stages 0, N mac h-i, and N mac h-2 in order to allow adjustment of dt/dx so that it is 
independently definable for the three main problem phases, M<1,1<M<2, and 2 < M. 

In an attempt to improve the numerical conditioning of the problem, the inequality 
constraints have been re-scaled so that the maximum or minimum limit, if nonzero, becomes ± 1. 
Similarly, the optimization algorithm works with the following units in order to deal with x and u 
elements nearly on the order of 1 to 10. Time is in 1,000 sec. units, altitude is in 10,000 ft. units, 
speed is in units of 1,000 ft/sec., mass is in 1,000 slug units, flight-path angle is in rad., and 
angle-of-attack is in deg. 

The first guess used is for this optimization roughly approximates two features found in 
optimal trajectories that appear in Ref. 4: an arc along the q = 2000 psf limit followed by an arc 
along the heating-rate limit. The guess starts at M = .4 and h = 10 ft. It assumes a linear increase 
of h and M until M = 0.85 and h equals the altitude at with M = 2 would yield q = 2000 psf. 
Next, the guess accelerates in level flight to increase M from 0.85 to 2, which brings the trajectory 
onto the q = 2000 psf contour. Now M and h increase along this contour until the heating-rate 
limit in eq. (18i) is reached. The guess continues increasing M and h along this contour until V = 
23,000 fps. It then does a pull-up to put the vehicle into a transfer orbit with apogee = 528,000 
ft., the target orbital altitude. Finally, the guess does a long bum near apogee to approximately 
circularize the final orbit. 

The air-breathing propulsion system is run at <t> a = 1 up until the pull-up into a transfer orbit, 
and the rocket throttle setting is kept at zero until after the pull-up. During the circularizing bum, 
<|) r = 1 is used, and the time of the bum is based on the velocity increment needed at apogee to 
circularize the orbit. 

Initially, this guess procedure yields a plot of h vs. V for the atmospheric phase. The time 
parameterization of the h-V profile and the guessed time histories y(t) and m(t) are determined by 
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satisfying Euler approximations of 3 of the state differential equations, eqs. (18c), (18d), and 
(180- The a(t) guess is determined to minimize the error in the Euler approximation of eq. (18e) 
subject to the minimum and maximum bounds on a in eq. (18j). 

Of course, this first guess violates many of the problem constraints. It violates all of the 
differential equations slightly, and it may violate any of the auxiliary constraints that have not been 
specifically used in this first guess procedure (e.g., the bounds on 8E(a,M)). 

6. Trajectory Optimization Computational Results 

All of the results reported in this section are for work done on a 32-node INTEL iPS C/860. 
Encoding has been done in parallel FORTRAN. This is a message-passing parallel processor. 
Each node of the iPSC/860 is a "super" scalar processor. Simple tests indicate that each node's 
floating-point processor speed is about 1.5 Mflops, which is comparable to an HP Apollo 9000 
model 720 work station. The interconnections between nodes are on a rectangular grid, but the 
effect of the "distance" between nodes in this grid is supposed to be minimal due to advanced 
message routing hardware. 

6.1 Results for the Goddard Problem 

The Goddard problem has been solved on this machine using a variety of numbers of 
discrete-time problem stages to approximate the problem and using a variety of numbers of 
processors. Problems with 16, 32, 64, and 128 stages have been solved. Processor numbers 
ranging from 1 to 32 have been used when appropriate. The results reported here use the alternate 
NP algorithm outlined at the end of Section 3, the one that always uses curved line searches 
(except during its feasibility phase). 

Solution time histories for a 128-stage problem are plotted on Figs. 5 and 6. They compare 
well with the exact solution given in Ref. 1. Figure 3 of Ref. 1 indicates that the fixed-end-time 
solution with tf = 0.198 is also the optimal free-end-time solution. The tf = 0.198 altitude plot on 
Fig. 5 of Ref. 1 is very similar to the altitude plot in the upper left-hand comer of Fig. 5 of this 
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report, and the tf = 0.198 thrust solution on Fig. 6 of Ref. 1 is very similar to the thrust plot on 
Fig. 6 of this report; the slight differences are due, most likely, to the problem discretization. 

An important point about the present algorithm is how well it captures the singular arc that 
appears on the thrust plot between t = 0.024 and t = 0.07. This is accomplished without having to 
tell the algorithm in advance that it must differentiate the optimality condition in order to get the 
right equation for determining the singular arc. In essence, this differentiation is done 
automatically by the QP solution procedure described in Section 2 and in Appendix A. 

Computational timing results for various sized problems running on various numbers of 
processors are presented on Fig. 7. This figure shows how the addition of more processors can 
speed up the solution procedure significantly. For a 64-stage problem, the algorithm terminates 
7.2 times faster when running on 32 nodes than when running on one node. The speediest 
solutions for a given number of problem stages are achieved when the number of processors is 
greater than or equal to one half the number of problem stages. Of course the number of 
processors must not exceed the number or problem stages or the ability to parallelize under the 
present scheme breaks down. 

The times on Fig. 7 are for solution of the problem from a cold start, a relatively poor first 
guess. If the algorithm were to be used in a real-time guidance loop, then it would start with a 
good first guess on each update. The guess would be the remainder of the optimal trajectory that 
had been computed for the previous guidance update. This would be identical to the current 
optimal trajectory if not for disturbances, sensor noise, and model uncertainty. It is hoped that 
these contributions to solution uncertainty would be relatively small. If such were case, then the 
real-time version of the guidance algorithm would need to perform only a very few of its major 
iterations: computation of functions and their first and second partial derivatives and solution of a 
quadratic program. Furthermore, the solution of each quadratic program would require only one 
or a very small number of minor iterations due to the goodness of the solution guess. Thus, the 
algorithm might quickly re-optimize trajectories to provide real-time optimal commands. 
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With this in mind. Fig. 8 presents the wall clock time per major iteration when the algorithm 
is near the solution. These results are for various numbers of processors, all of them solving the 
64-stage Goddard problem. Note how the time per solution is split about evenly between the 
derivative calculations and the QP solution procedure for the case of 1 processor, but the QP 
procedure takes about 5 times as long as the derivative calculations when using 32 processors. 
This happens because the derivative calculations are inherently more parallelizable than the QP 
solution algorithm. Also, note that the 32-processor result is 12 times faster than the 1 -processor 
result. 

If the algorithm could terminate in 10 such iterations during a real-time guidance update, then 
guidance updates could occur about once every 2.5 sec. Noting that time is expressed in Hergs, 
the actual time per zero-order-hold interval is 2.54 sec for the 64-stage problem. Therefore, this 
scheme could work in a real-time loop under the assumption of 10 terminal-type algorithm 
iterations per guidance update. Note that the ability to perform such updates in the required 
number of iterations has not, as yet, been studied. 

The dependence of the number of major and minor algorithm iterations on the number of 
processors is shown in Figs. 9 and 10, respectively. Major iterations are those that calculate 
derivatives and start solving a new QP; they are termed "middle-loop" iterations in the paper in 
Appendix B. Minor iterations are those that perform one set of matrix factorizations and a line 
search; they are termed "inner-loop" iterations in Appendix B. 

Figures 9 and 10 demonstrate that the solution time is moderately influenced by a change in 
the required number of algorithm iterations with a change in the number of processors. This 
change is thought to be due primarily to the way in which the algorithm deals with directions of 
negative curvature in the QP projected Hessian. It can calculate different directions of negative 
curvature for the same indefinite QP when using different numbers of processors. 

6.2 Results for the Minimum-Time to the Origin Problem 

Both 32-stage and 64-stage approximations of the acceleration-limited minimum-time to 
origin problem have been solved. Figure 1 1 plots several different time histories associated with 
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the 64-stage solution. On the upper left-hand plot of the figure al and a2 are the accelerations. 
They correspond to X 5 and X 6 , respectively, in the model presented in eqs. (14a)-(14d) of Section 
5.2. Similarly, vl and v2 on the lower-left plot of Fig. 1 1 correspond to X 3 and X 4 , respectively, 
in eqs. (14a)-(14d). These plots agree very well with the exact solution computed from the linear 
tangent steering law, eqs. ( 10 ) and ( 11 ). 

One noteworthy aspect of the solution is that an auxiliary state-variable inequality constraint, 
eq. (14c) is active at all times. The sum of the squares of the curves for al and a2 is equal to 1 at 
each time point on the upper left-hand plot of Fig. 11. The algorithm is able to compute the 
optimal solution without the typical requirement that eq. (14c) be differentiated in order to 
introduce a control variable. In fact, just the opposite is the case in this problem. 

The state variable constraint in eq. (14c) was originally a control variable constraint, eq. (9e), 
in the original problem form, eqs. (9a)-(9f). The model has been modified by making the original 
controls into states and creating new controls that are the time derivatives of the original controls. 
This has been done in order to smooth out the zero-order-hold effects and achieve a better 
approximate solution for a given number of discrete-time steps. This modeling trick would be 
counter-productive were it not for the algorithm's ability to automatically handle pure state 
constraints. 

The 32-stage minimum-time problem has been solved on 1,2, 4, 8 , 16, and 32 processors. 
Figure 12 plots the solution times for this problem versus the number of processors used to solve 
the problem. As with the Goddard problem, an increase in the number of processors decreases the 
wall-clock time required for solution. The 5-fold decrease in the solution time as the number of 
processors increases from 1 to 32 is about the same as for the 32-stage Goddard problem. 

The variation of the number of major algorithm iterations with the number of processors for 
the 32-stage minimum-time problem is plotted on Fig. 13. Similarly, Fig. 14 plots the number of 
minor algorithm iterations vs. the number of problem stages. Both of these curves increase in 
comparison to the corresponding plots for the 64-stage Goddard problem, review Figs. 9 and 10. 
It is not clear why the iteration counts are higher for this problem. The cause may have something 
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to do with the increased number of states in the final form of the problem model (7 in this problem 
vs. 4 in the Goddard problem), or it may have to do with the presence of an active state inequality 
constraint in this problem. 

The variations of the iteration counts with the number of processors are more pronounced 
than for the Goddard problem. The 1 -processor case requires significantly fewer of each type of 
iteration. Again, these discrepancies are attributable to the presence of an indefinite projected 
Hessian during many (the majority!) of the QP/line-search minor iterations. 

The time per major iteration as it varies with the number of processors is plotted on Fig. 15. 
These times are for major iterations near successful algorithm termination, when only one QP 
search direction calculation is required per major iteration. Major iterations far from the solution 
generally require multiple QP search direction calculations and, therefore, take significantly longer. 
Plotted along with the total time per iteration are its principal components, similar to Fig. 8 for the 
Goddard problem. 

The total iteration time is roughly the same as for the Goddard problem (review Fig. 8), but 
the proportion of time spent solving the QP is larger than for the Goddard problem. This makes 
sense because the QP algorithm's scalar operation count goes as [n 3 + 0(n 2 )][(N/p) + log2(p)] 
when solving a problem with n state vector elements and N stages on p processors. For the 
present problem n = 7 and N = 32, whereas n = 4 and N = 64 for the Goddard problem on Fig. 8. 
The plots show that the QP solution procedure for the minimum-time problem is slower for all 
cases considered, but not quite as slow in comparison to the corresponding Goddard problem 
procedure as is indicated strictly by the leading terms of the scalar operation count. 

The iteration speed for the 32-stage minimum-time problem operating on 16 or 32 processors 
is under 0.5 sec. This is a fast execution time and may indicate suitability for use in real-time 
guidance. Any serious consideration of the real-time guidance issue will have to determine a 
reliable estimate of the typical number of iterations required to re-converge the solution after one 
guidance update interval. 
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6.3 Results for the NASP Ascent Guidance Problem 

A 32-stage NASP ascent guidance problem has been solved using the parallel algorithm 
developed under this grant. The solved problem models the Mach 1 transition, and the 
concomitant change in the aerodynamic model due to canard retraction, as occurring at stage 
N m ach-i = 6. It models the Mach 2 transition and concomitant propulsion model switch as 
occurring at stage N mac h -2 = 9. The initial stage is stage 0, and the terminal stage is stage N = 31, 
which leaves 22 zero-order-hold intervals for flight above Mach 2. These hold intervals have been 
modeled as all being of equal real-time duration. 

The algorithm solved the problem in 2 hours of wall clock time using all 32 of the 
iPSC/860's processors. It executed 2035 major algorithm iterations to solve the problem, about 
100 of which were used to achieve initial approximate feasibility of the trajectory; the NP algorithm 
described in Appendix B has an initial phase in which it enforces feasibility. 

The alternate NP algorithm described at the end of Section 3 is the one that has been used to 
solve this problem. An attempt was made to solve the NASP problem with the first NP algorithm 
of Section 3, but it did not get near convergence after 2000 major algorithm iterations. 

Plots of the solution are presented on Figs. 16, 17, and 18. The mass time history, the 
lower-right plot on Fig. 16, shows that the final on-orbit mass is about 56% of the original post- 
lift-off weight, which seems reasonable. The altitude and flight-path angle time histories, the two 
left-hand plots on Fig. 16, show an initial steep climb to about 150,000 ft. After this the vehicle 
climbs more slowly until it reaches its orbital velocity at an altitude of just below 200,000 ft. 
Finally, it executes a pull-up into a transfer orbit to reach its target orbital altitude. 

The control time histories on Fig. 17 show that the air-breathing engine provides most of the 
acceleration. Its fuel equivalence ratio, shown on the upper right-hand plot, is non-zero 
throughout the ascent. It seems a little bit silly that <t> a remains nonzero after the exit of the 
atmosphere at about t = 3,000 sec, but this does not cause much loss of final mass because so little 
actual fuel flow is involved. The rocket throttle setting, the lower left-hand plot, remains at zero 
until just before the terminal time when it helps to circularize the final orbit 
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The angle of attack time history, the upper left-hand plot on Fig. 17, shows some jitteriness 
coupled with a slow downward trend as speed increases toward the orbital velocity. There is a 
pull-up just before t = 3,000 sec to initiate the transfer orbit, then a is held negative during the 
transfer orbit and part of the terminal bum, presumably to help circularize at the proper altitude. 

The jitteriness of the angle-of-attack plot points up a shortcoming of this solution. It seems 
that the zero-order-hold intervals are too long during the flight phase above Mach 2. Each interval 
is about 200 sec long. If V = 2,900 fps, then the approximate Phugoid period is 400 sec. The 
lightly-damped Phugoid mode will alias if it has a shorter period (a higher frequency), which it will 
have for V < 2,900 fps. This aliasing of a lightly-damped open-loop mode could easily account 
for the jitters apparent on all of the plots, especially those showing angle-of-attack and flight-path 
angle. 

The altitude vs. Mach number plot, the left-hand plot on Fig. 18, shows related problems'*'. 
The vehicle accelerates from Mach 2 to Mach 17 in just one zero-order-hold interval — compare this 
plot with the velocity time history in the upper right-hand comer of Fig. 16. During this hold 
interval the trajectory may violate the dynamic pressure constraint because the auxiliary constraints 
are enforced only at the boundaries of the hold intervals. (Note that the straight-line approximation 
of the trajectory shown on the figure is probably not the actual trajectory computed via numerical 
integration.) The two spikes that appear on the h vs. M plot also may be due to the long hold 
interval. 

There are two possible fixes for the aliasing problem. One is to model the controls as being 
piecewise linear or piecewise parabolic. This could be done by augmenting the state vector with 
the original controls and making new control variables out of the first or second time derivatives of 
the original controls. The smoothed-out control signals would be less likely to excite the lightly- 


+ On a different note, the h vs. M curve changes from solid to dashed above about 280,000 ft 
because Mach number is not very meaningful at such extreme altitudes. 
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damped phugoid mode even when aliasing occurred. Another, simpler fix would be to greatly 
increase the number of problem stages. Attempts have been made to solve a 64-stage problem, but 
the algorithm took too long to converge in the tests conducted to date. 

The average time per major iteration of the algorithm is 3.5 sec. of wall clock time for this 
problem. Iteration times near successful algorithm termination are probably even shorter than 3.5 
sec. This is in the ball park of the kind of iteration speed that might make real-time optimal 
guidance practical for the NASP ascent problem, but other difficulties with the algorithm need to be 
worked out before it can be used. In particular, a solution to the aliasing problem must be found. 

Also, algorithm modifications are needed in order to speed global convergence. The slow 
global convergence of the algorithm on the NASP problem implies that standard guidance updates, 
despite having relatively good first guesses, would probably require many iterations of this 
algorithm. This would probably require too much time to make on-line solution practical. 

7. Discussion of Algorithm Performance and Suggestions for Improvement 
Application of the parallel trajectory optimization algorithm to the 3 test problems has shown 
some strengths and some weaknesses of the approach. One of the two most significant strengths 
is the ability of parallelism to greatly speed up each major algorithm iteration, which consists of 
derivative calculations, a search direction calculation, and a search step. These operations are 
common to almost all optimization algorithms, and the results of this work could be applied to 
parallelize trajectory optimization algorithms other than the one presented here. 

Another significant strength of the algorithm is the rapidity with which it is able to determine 
a feasible solution to a problem. Such a solution obeys the dynamics and all of the auxiliary 
constraints, including any terminal constraints, but it may be far from the optimum of the given 
cost function. If the most important guidance objectives are stated in terms of terminal constraints, 
such as the constraint that the NASP achieve a circular orbit of a certain altitude, then the algorithm 
can rapidly design sub-optimal trajectories that meet the most important guidance objectives. 
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The main weakness of the algorithm is its slowness to achieve convergence to the optimal 
solution. It can require many major iterations, most of which require a number of minor iterations. 
Each minor iteration involves solution of an equality-constrained dynamic QP. While this process 
has been parallelized, it is inherently difficult to achieve good parallel efficiency for this process. 
The algorithm would do well to cut down on the number of equality-constrained QP solutions that 
it computes on the way to a solution. 

One possible cause of much of the slowness revolves around the algorithm's actions when it 
encounters an indefinite projected Hessian during computation of a search direction. Presently, it 
attempts to search in a direction of negative curvature every other time that it encounters an 
indefinite Hessian, but it cannot guarantee that it is searching in the direction of most negative 
curvature. 

The presence of quadratic constraint terms in the middle-loop QP problem (see the paper in 
Appendix B) may cause more harm than good. These terms, when coupled with the algorithm's 
logic for dealing with negative cost function curvature, can cause the algorithm to cycle between 
several search directions, none of which makes much progress in reducing the cost function. 

Another problem may be the augmented Lagrangian NP algorithm itself. The current 
implementation, as described in Appendix B, is designed to work well with large penalty weights 
in order to be able to assure rapid local convergence. Unfortunately, large penalty weights can 
cause global convergence problems that are not addressed by the current algorithm. 

Yet another problem with the algorithm may be its use of the box trust region method of 
assuring global convergence. This method tends to lean heavily on equality-constrained QP 
factorizations and to rely only very lightly on function evaluations. In the parallel environment, QP 
solutions are relatively expensive, and function evaluations are relatively cheap. A more efficient 
parallel algorithm might be one that ensures global convergence by using line searches to determine 
the search step length. 

The rapidity of the constraint satisfaction phase of the algorithm suggests an alternate 
approach to develop a rapid parallel trajectory optimization algorithm. Instead of using a standard 


34 




NP type of algorithm, one could develop a parallel equation solving algorithm for the necessary 
conditions. While this would pose some difficulties in that the Kuhn-Tucker necessary conditions 
involve inequalities and the complementarity condition, these difficulties could probably be 
surmounted. The algorithm would use the norm of the necessary condition violation as its merit 
function. It could use this merit function in a Levenberg-Marquardt-type of approach. Parallel 
solution times might be on the order of the times required by the current algorithm to find an initial 
feasible trajectory during its first phase. Such performance, if realized, would make the necessary 
condition solver 20 times faster than the present algorithm when solving the NASP problem. 

The equation solving approach has the advantage that many of the parallel ideas developed 
under this grant could be directly applied to it, but it has the disadvantage of possible convergence 
to a saddle point or to a local maximum; the necessary conditions only ensure that a stationary point 
has been found. In any event, one of the parallel QP factorization algorithms of Appendix A could 
be used to check whether or not the stationary point was a local minimum. The algorithm would 
form and Cholesky factorize the projected Hessian at the solution. If the Cholesky factorization 
terminated successfully without producing a negative diagonal element, then the solution could be 
certified to be a local minimum. 


8. Conclusions 

An effort has been made to develop a fast parallel trajectory optimization algorithm suitable 
for use in on-line, real-time guidance of an aerospace vehicle. The parallel algorithm approximates 
continuous-time phases of a problem via the zero-order-hold control discretization and Runge- 
Kutta numerical integration over the hold intervals. This gives rise to a large nonlinear program 
with the specialized dynamic programming structure. The parallel algorithm uses a specialized 
version of the augmented Lagrangian nonlinear programming algorithm to solve this problem. 

The algorithm starts with a guessed solution that need not satisfy the dynamics; state and 
control time histories are guessed. Next, the algorithm computes the cost, dynamics, and auxiliary 
constraint functions and their first and second partial derivatives. This is accomplished in parallel 
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by having different processors simultaneously compute these quantities for different time steps. 
The algorithm then uses this information to set up a quadratic approximation of the original 
trajectory optimization problem. It solves this quadratic approximation, subject to trust region 
bounds, to compute an improved guess of the solution. It uses a special parallel dynamic quadratic 
programming algorithm to solve this problem. Given an improved guess, this process repeats 
itself until suitable termination criteria are satisfied. 

Global convergence to a local minimum is assured by adaptively adjusting the trust region 
size to enforce descent of the augmented Lagrangian function. Satisfaction of the dynamic 
constraints and the auxiliary constraints is enforced via an outer loop in which multiplier guesses 
are updated to bias the penalty terms in the augmented Lagrangian function. 

The algorithm has been tested off-line on three problems, the Goddard problem of 
maximizin g a sounding rocket's peak altitude, the planar, acceleration-limited minimum-time to the 
origin problem, and a lift-off-to-orbit National Aerospace Plane (NASP) minimum-fuel ascent 
guidance problem. The algorithm is able to accurately solve problems with singular arcs or with 
active state-variable inequality constraints without placing any special burden on the problem 
modeling process. 

Good solution speeds have been achieved on the Goddard and minimum-time problems, 
even with fairly poor first guesses. A 128-stage Goddard problem has been solved in just 118 sec 
using all 32 nodes of an INTEL iPSC/860 parallel processor. This problem has 4 state variables 
and one control variable after special modeling tricks have been applied. A 32-stage minimum-time 
problem has been solved in 151 sec. The problem model that was submitted to the algorithm has 7 
state variables and 2 control variables. 

The use of fewer processors results in increased solution times. The solution of a 64-stage 
Goddard problem takes 7 times longer on 1 processor than on 32 processors. The optimum 
number of processors seems to be about half the number of problem stages for typical problems. 

Performance on the NASP problem has not been as good. A 32-stage problem with a 5- 
dimensional state vector required 2 hours (7200 sec) to reach a solution when using 32 processors. 
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The time per major algorithm iteration is low for all three problems, especially when the 
algorithm is near successful termination. Each major iteration calculates functions and derivatives, 
solves a QP, and updates the solution estimate. When using all 32 processors, the iteration time on 
the Goddard problem is 0.26 sec near algorithm termination, but this time increases to 3.08 sec 
when using just 1 processor. Similarly, a 32-stage minimum-time problem's iteration time near 
algorithm termination is just 0.38 sec on 32 processors, but it is 3.09 sec on 1 processor. A 32- 
stage NASP problem has an average major iteration time of 3.5 sec when run on 32 processors, 
and its iteration time near algorithm termination is even smaller, but it required 2035 such iterations 
to converge from the first guess tried in this study, which accounts for the long time required to 
solve that problem. 

The suitability of this algorithm for on-line, real-time guidance purposes depends on two 
things, its iteration speed and the number of iterations that it must execute to re-converge to the 
solution between guidance updates. The algorithm developed under this grant has a proven ability 
to iterate rapidly, but its ability to converge in few iterations seems to be problem-dependent. In 
summary, the algorithm is not suitable for all on-line guidance applications in its present form, but 
it may be suitable for some on-line applications. 
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Table 1 


NASP Center-of-Gravity Location 
as a Function of Mach Number 


M 

x/c 

0.3 

0.204000 

0.7 

0.217000 

0.9 

0.230000 

1.5 

0.243292 

2.5 

0.234107 

4.0 

0.137814 

6.0 

0.068927 

10.0 

0.040987 

15.0 

0.030701 

20.0 

0.031225 

24.2 

0.030864 
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vs. Mach number and fuel equivalence ratio. (Mach 
number scale changes at Mach 2.) 












sion svstem's specific impulse 
fuel eouivalence ratio. (Mach 
at Mach 2.) 
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Fig. 3. Trimmed lift coefficient vs. Mach number and angle 
of attack. (Mach number scale changes at Mach 1.) 
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Goddard problem. 
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Goddard problem. 




Nodes 



Solution time vs, the number of problem stages 
for the Goddard problem run on various numbers 




Time per major iteration and its components when 
near the solution of the 64-stage Goddard problem 
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Fig. 10. The number of minor QP iterations vs. the number of 
processors for the 64-stage Goddard nroblem. 
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Solution time histories and phase plot for the 
64-stage minimum-time to the origin oroblem. 
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Fig. 12. Solution time vs. the number of nrocessors for the 
32-stage minimum- time to the origin problem. 
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algorithm iterations vs. the 
the the 32-stage minimum- time 
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The number of minor QP iterations vs. the number of 
processors the the 32-stage minimum-time to the 
origin problem. 
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Time per major iteration and its components 
when near the solution of the 32 -stage minimum- 
time to the origin problem. 
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Fig. 16. State time histories for a 32-stage NASP ascent 
guidance problem. 
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Two papers on parallel dynamic 
quadratic programming. 
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Abstract. A key algorithmic element of a real-time trajectory optimization hardware/software 
implementation is presented, the search step solver. This is one piece of an algorithm whose 
overall goal is to make nonlinear trajectory optimization fast enough to provide real-time commands 
during guidance of a vehicle such as an aero-maneuvering orbiter or the National Aerospace Plane. 
Many methods of nonlinear programming require the solution of a quadratic program (QP) at each 
iteration to determine the search step. In the trajectory optimization case the QP has a special 
dynamic programming structure, an LQR-like structure. The algorithm exploits this special 
structure with a divide and conquer type of parallel implementation. A hypercube message-passing 
parallel machine, the INTEL iPSC/2, has been used. The algorithm solves a (p-N)-stage problem 
on N processors in 0(p+log2N) operations. The algorithm yields a factor of 8 speed-up over the 

fastest known serial algorithm when solving a 1024-step test problem on 32 processors. 


Key Words. Trajectory optimization, parallel processing, quadratic programming, dynamic 
programming, linear quadratic regulator. 
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1. Introduction 


The present work is part of an effort to do real-time optimal guidance by repeatedly solving 
trajectory optimization problems on line. The form ot the nonlinear problem that must get solved is 

T 


find: 

TTT t 7X1 


(la) 


N 


(lb) 

to minimize: 

J = Y L()c k ,u k ,k) 



k=0 



subject to: 

Xq given 


(lc) 


Xjc+1 = f(XkiUk.k) 

for k = 0 ... N-l 

(Id) 


0 
II VI 

2 

1 

for k = 0 ... N 

(le) 


where x k is the state vector at stage k, and u k is the control vector. The nonlinear difference 

equation [Eq. (Id)] defines the system s dynamics, and Eq. (le) defines any auxiliary constraints. 
Note that control vector u^ is associated with the terminal stage N, which is unusual, but possible. 

Also, the number of elements in the state vector and control vector may vary trom stage to stage. 
The word stage is used to refer to a single discrete time step, not to a system staging process such 
as booster separation on a launch vehicle. 

Trajectory optimization is a mature discipline, and algorithms exist for solving such problems 
(e.g.. Refs. 1-7), but none can be used for real-time guidance. This paper is part of a research 
effort aimed at developing a parallel trajectory optimization algorithm that can qualify tor real-time 
use via improved speed and convergence reliability. This paper partly addresses the issue of 
algorithm speed by developing a parallel implementation that reduces the time required to determine 
the search direction for a single iteration of a trajectory optimization algorithm. It is applicable to 
any trajectory optimization algorithm that generates search steps and that is formulated to work on a 
discrete-time system. 

Many search-based trajectory optimization algorithms compute the search direction by 
solving a linear-equality-constrained, quadratic-cost problem. Such algorithms use active set 
strategies, which guess that all of the equality constraints and a subset of the inequality constraints 
are satisfied as exact equalities. The linear equality constraints of the search-step subproblem are 
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just the linearizations of the constraints in the active set [Eq. (Id) and a subset of the rows of Eq. 
(le)]. In Newton's method, the Hamiltonian is approximated by retaining Taylor series terms out 
to the quadratics; this results in a problem with a quadratic cost. Even the steepest-descent method 
effectively uses a quadratic cost model: the Hamiltonian is expanded out to the linear terms, and its 
Hessian is approximated by the identity matrix in the control diagonal block and zeros elsewhere. 
Such problems are similar to time-varying LQR problems. The present paper concentrates on the 
development of an efficient means of solving such problems on a distributed-memory message- 
passing parallel processor with a hypercube connection topology. 

A number of other researchers have attempted to speed up the solution of problems like Eqs. 
(la)-(le) via use of parallel processors. Larson and Tse tried to speed up Bellman s state/control 
discretization approach to dynamic programming (Ref. 8). Menon and Lehman proposed a method 
based on integrating matrices and general parallel matrix solvers (Ref. 9). Travassos and Kaufman 
proposed methods based on the TPBVP approach and a general parallel optimization package 
applied to enforce satisfaction of the terminal boundary condition (Ref. 10). Chang et. al. 
decompose the problem into shorter problems of several stages each (Ref. 11), first optimizing 
each group separately, in parallel, then optimizing the interconnected problem in an outer 
optimization loop. Betts and Huffman have calculated cost gradients and constraint Jacobians in 
parallel (Ref. 12). Wright proposes calculating the gradients. Jacobians, and Hessians in parallel 
and develops an algorithm to solve for the Newton step with a parallel factorization of the 
linearized necessary conditions (Ref. 13). 

The present algorithm is a divide-and-conquer type algorithm and is closely related to 
Wright's method of factorizing the linear necessary conditions. Both methods are, essentially, 
special factorizations of the Kuhn-Tucker matrix. The present method has capabilities that 
Wright's method lacks: it can handle auxiliary state constraints, and it can detect and correct for an 
indefinite projected Hessian. The present factorization algorithm is interpreted as a series of 
parallel partial solutions of single-stage optimization problems followed by grouping of pairs of 
stages into single longer stages. Thus, parallelization is accomplished on a stage-wise basis. 
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Repeated application eventually reduces the problem to a single stage. This parallel reduction ol 
the number of problem stages is similar to the approach that Chang et. al. use for the nonlinear 
problem (Ref. 11), though they only go through one reduction cycle. Two versions of the basic 
algorithm are presented: the "pure " algorithm, which assumes that the number of stages equals the 
number of processors, and a "hybrid" algorithm for use when the number of stages exceeds the 
number of processors. 

The next section of the paper reviews general equality-constrained QP procedures and the 
usual backwards sweep method of solving such problems on a serial processor. Section 3 
presents the outline and details of the parallel QP algorithm. It also gives details of a benchmark 
serial algorithm that has been used to determine the speed-up due to parallelism. Section 4 
describes the test problem for the algorithm and gives timing results for the parallel and serial 
algorithms. Section 5 includes further discussion of the parallel algorithm. Section 6 concludes 
the paper. 


2. Generalized Time-Varying LQR Problem with Equality Constraints 

Algorithms for solving Eqs. (la)-(le) start with a guessed solution time history and 
generate a sequence of time histories that converges to the optimum. In 

Newton's method (with an active-set strategy for the inequalities), the search direction 52i is 
computed at each iteration by solving a problem that looks like a time-varying discrete-time LQR 

problem, with some extra elements: 


find: 


S& = 8uJ,5xj,5u{,5x2,...,5u^.j,5x^,5u^ 


N 


to minimize: J = 


I 


T 


5x k 


“ T T " 

5x k 

5x k ,5ii k 

T 


+ 

Sx k’ Su k 

_ 2 * * 5 Mk_ 


k=0 

subject to: 5x<) given 

Sxk + i = F k 5 *k + Gk Su k + z k 


for k = 0 ... N-l 


(2a) 

} (2b) 

(2c) 

(2d) 



A k Sxfc + B k 5u k + Cjj = 0 


for k = 0 ... N 


(2c) 


This is a minimization problem with a quadratic cost and linear equality constraints. The 

quadratic cost in Eq. (2b) comes from the second order approximation of the Hamiltonian about the 
current iterate. The H XXfc , H XUk , and H UUk matrices correspond to the quadratic terms and the g Xk 

and £„ k vectors correspond to the linear terms. Equation (2d) is a linearization of Eq. (Id) about 

the current iterate. The F k matrix is the state transition matrix, the G k matrix is the control 
effectiveness matrix, and the z k vector represents the amount by which Eq. (Id) is not satisfied at 
the current iterate. At each iteration, a set of inequality constraints in Eq. (le) are assumed active 
and are enforced as equalities. They are linearized about the current iterate point to yield some of 
the linear equality constraints in Eq. (2e). The remainder of the linearized equality constraints in 
Eq. (2e) arise from the linearizations of any equality constraints in Eq. (le). The A k and B k 
matrices make up the Jacobian of the active constraints, and the c k vector's elements represent the 
amounts by which active constraints in Eq. (le) are not satisfied at the current iterate. 

The differences between the problem of Eqs. (2a)-(2e) and a standard time-varying, discrete- 
time LQR problem are the following: the presence of auxiliary constraints [Eq. (2e)], the presence 
of nonhomogeneous terms in the state difference equation, z k , and the presence ot the linear cost 

terms. Also, the quadratic cost matrices need not satisfy the conditions necessary tor LQR 
stability. The search step calculation problems of most trajectory optimization algorithms can be 
transformed into this special form (e.g., those of Refs. 5 and 7). 

2.1. Variable Reduction Method for General Equality-Constrained Quadratic 
Programming. The variable reduction method uses the constraints to solve tor some of the 
problem variables in terms of the remaining variables (Ref. 14). If the stagewise quantities in Eqs. 
( 2 a)-( 2 e) are lumped together, the following problem results: 

find: 5* (3 a ) 

to minimize: J = ^- 82 £ T H + g T 52£ (3b) 
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subject to: A 6 >i + £ = 0 


(3c) 


The cost in Eq. (3b) includes all the terms in Eq. (2b) combined into a large Hessian matrix and 
gradient vector. The constraints in Eq. (3c) correspond to all of the constraints in Eqs. (2c)-(2e) 
for ail of the stages (including the difference equations), which get lumped into the large A matrix 
and the large £ vector. 

In the variable reduction method, the A matrix in Eq. (3c) is split into [=A , , ^ 2 ] such that 
A x is a square matrix, and all the remaining quantities except £ in Eqs. (3a)-(3c) are split in 
accordance with this: 


find: 

to minimize: 
subject to: 


5^1, 52i 2 


r T TH 

H'l \ H 12 

rw 


- j T1 

52i, 

52i[,52i 2 

- 

1 

CM 

XI 

cO 
J 


9i#2 

8 x 2 


1 , A 


5 *, 

5^2 


+ £ = 0 


(4a) 

(4b) 

(4c) 


If A\ is nonsingular, the variable reduction technique eliminates 5x, and the constraints 
from the problem. The vector bXj is replaced by its constrained value: 


8 &, = - A\\A 2 b^ 1 + cj ( 5 ) 

Substitution of Eq. (5) into Eq. (4b) leaves a cost that depends only upon 62 ^ 2 * and the constraint 
in Eq. (4c) is satisfied for all values of 5x 2 by virtue of Eq. (5). Therefore, unconstrained 
minimization of the resulting cost function can be performed with respect to 5 ^. 

2 . 2 . Sequential Variable Reduction/Partial Solution to Generate the Matrix- 
Riccati Equation. The time-varying matnx Riccati Equation of discrete-time LQR theory is the 
result of a QP solution procedure that is akin to the variable reduction technique. An appropriate 
name would be the sequential variable reduction/partial solution technique. It allows exploitation 



of the special structure of the cost and constraints in Eqs. (2a)-(2e). It involves repeated 
application of a two-stage process. First, it uses some of the constraints to eliminate some of the 
variables, as in the variable reduction technique. Second, it optimizes any variables that become 
unconstrained after the first stage. This optimization process involves writing down the gradients 
of the cost with respect to the unconstrained variables and solving the resulting equations for the 
unconstrained variables. These unconstrained variables are expressed in terms of the remaining 
problem variables. Substitution of this expression for the unconstrained variables back into the 
cost eliminates these variables. 

Assuming no control variable at the terminal stage and no auxiliary equality constraints [no 
Eq. (2e)] the matrix-Riccati technique performs a variable reduction by using the stage N-l 
difference equation to eliminate 5x N from the problem. This results in the stage N- 1/stage N cost 

expression: 


T rs T ~ 

(H x * n _j +F N -iHxxj^N-i) (H xu n ^ +f n-i H xx n g n-i) 

Sxn-1 


(Hxun_i +F N-1^xxn G N-i) T ^uun_i + Gn-1^xxn G N-i) _ 

_5u n 


+ |jgx N . 1 +F N-i(Sx n +H xx n Zn)} T -(Su n . 1 +( Jn-i(£x n +H xx n ^n)} T 


Sx N -i 

L^Un-i 


1 T 


T 


+ J 2nH xx + 


(6) 


The stage N-l control no longer appears in any of the constraints, and it only appears in the 
portion of the cost expressed in Eq. (6). Therefore, Su^.i can be optimized out of the problem. 
Optimization of Eq. (6) with respect to Su^.i yields: 


&UN.! = *(H UUn . 1 + Gn.i H xx n ^N- 1 )' 1 {( H xu N . 1 +F N-1 H xx n Gn-i) ^N-l 

+ Su N _! + Gn.i(£x n +H xx n 2n)) 


(7) 
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Equation (7) is then used to eliminate from Eq. (6). This results in a stage N- 1/stage N cost 
that depends only on Sx^. The resulting formula for the Hessian of that cost is just the discrete- 
time matrix Riccati equation. 

3. Parallel Stage-Halving Algorithm 

The principle of variable reduction followed by partial optimization is also used in the parallel 
factorization algorithms of this paper. The algorithms also include the auxiliary equality 
constraints, Eq. (2e). Two algorithms will be described. The first, described below and in section 
3.2, is for the situation where the number of processors equals the number of stages. Section 3.3 
describes a hybrid between section 3.2’ s stage halving algorithm and a serial backwards sweep 
algorithm. It is useful when the number of problem stages exceeds the number of processors. 

Three main differences distinguish the stage-halving algorithm from the backwards sweeping 
method. First, the auxiliary constraints are used to reduce the number of control variables, if 
possible. Second, QR factorization is used, when possible, to free control variables from 
dependence on difference equation constraints. This allows partial optimization of control 
variables at early stages. Both of these operations can be carried out in parallel. Neither of these 
operations changes the number of stages or the number of state variables at each stage, but they 
both can reduce the number of controls at each stage, and the first operation can reduce the number 
of auxiliary constraints at each stage. 

The third main deviation from the backwards sweep algorithm is that this algorithm follows 
its control vector/constraint reduction with a halving of the number of stages. This is accomplished 
by a standard variable reduction. State vectors 5x i , Sx. 3 , .... 5x N (assume N is odd) are eliminated 

by substitution of state difference equations 0,2,...,N-1 into the cost and the remaining auxiliary 
constraints. This is done in parallel and yields a problem of the same form as Eqs. (2a)-(2e), but 
with half as many stages; each stage corresponds to a doubled discrete-time interval. The sizes of 
the state vectors at the remaining stages are unchanged. The number of control vectors and the 
number of auxiliary constraints per stage both grow, but are bounded -- recall that the operations 



just prior to stage halving both serve to reduce the number ot constraints and controls. In both 
cases, the bound equals twice the dimension ot the state vector. 

The algorithm then repeats its cycle of reducing the number of control vector elements and 
constraints followed by a halving of the number of stages. This goes on until only one stage is 
left. That stage is completely solved, and the result is back-substituted into the preceding 2-stage 
problem and so forth until the entire solution is generated. The time to perform these operations on 
N processors goes as n^log 2 (N), where n is the (average) number of state vector elements. 

The only message passing of the above-outlined algorithm occurs during the stage-halving 
cycle and during the backwards substitution. The ideal architecture for such operations is a binary 
tree. Such a tree can be implemented on a generic hvpercube architecture. The algorithm has been 
implemented on a 32-node hypercube, the INTEL iPSC/2 at the Cornell Theory Center. 

Figure 1 displays the binary tree and the initial distribution of problem stages on 16 
processors (nodes of the tree) for a 16-stage problem. The arrows indicate the message passing 
that occurs during the first stage-number halving. The bottom plot indicates the processor 
locations of the remains of the problem stages after this halving process. 

3.1. Stage-Having Algorithm on an Example Problem. In this section, a simple 
4-stage QP problem is solved using the above algorithmic idea. This should help understand the 
detailed algorithm that is described in the next section. 

The example problem to be solved here is: 


find: 

[u 0 ,x 1 ,u 1 ,x 2 ,U2,x 3 ] T 

2 

(8a) 

to minimize: 

J = y 2 (x k + “k> 

(8b) 


k=0 


subject to: 

x 0 = 1 

(8c) 


x k+t = x k + u k for k = 0,1,2 

(8d) 


x 3 = 4 

(8e) 


where all of the variables are scalar quantities. Table 1 shows this problem in a stage wise manner. 


All 



Table 1. Stages of original example problem. 


Stage 

Dynamic equations(D/E) 

Auxiliary constrainis(A/C) 

Cost 

0 

Xj = 1 + u 0 

none 

T(! + Uq) 

1 

x 2 = X, + U| 

none 

1 , 2 ^ 2, 

T (X 1 +U 1 } 

2 

x 3 = x 2 + u 2 

none 

+ Oj) 

3 

none 

x 3 =4 

none 


The first step in this example is to halve the number of stages by joining stages in pairs using 
the dynamic equations, i.e„ Xj is eliminated from the problem using the dynamic equation for 
stage 0, and x 3 is eliminated from the problem using the dynamic equation for stage 2. Table 2 
shows what the problem becomes after this operation. Stage O' in Table 2 represents the (0,1 } 
stage pair of the original problem and stage 1' represents the (2,3) pair. The new variables in 
Table 2 are defined as follows: 


UO' = 


uo 

, u i} 


X;' = X 2 , Uj' - Ui 


( 9 ) 


Table 2. Example problem after stage halving. 


Stage 


D/E 


A/C 


Cost 


O’ 

r 


X r = 1 + [1, 1] Uq. 


none 


1 T 
TMo- 


none 


•xr + u r = 4 


2 0 
0 1 

l 


Uo' + f 1 , 0] Uq- + 1 


+ U P 
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The next step is to remove u r from the problem using the auxiliary equation, x r + u r = 4, 
which yields Uj* = 4 - x 3 *. This expression is used to eliminate u ; ■ from the cost at stage 1 . Table 

3 shows the resulting problem. 


Table 3. Example problem after elimination of auxiliary constraints. 


Stage 

D/E 

A/C 

Cost 

0’ 

x r = 1 + [1, 1] Uq- 

none 

, t r 2 oi 

2 no' l^o ijiio' + n»o] uq' + 1 

r 

none 

none 

xj, • 4 x r t 8 


The next step is to QR factorize the [1, 1] matrix in the dynamic equation of stage O'. This 
right QR factorization transforms uxr so that one of the transformed control vector elements does 

not enter the dynamic equation. 


[1, 1]Q = [V2, 01 


Q 


%• 

%• 


= no- 


where Q = —- 

\2 _ 


, an orthogonal matrix. 


(10a) 

(10b) 


where u 3() , and u 4() , are the elements of the transformed control vector. Table 4 shows the resulting 
problem. 


Table 4. Example problem after orthogonal transformation of control vector. 


Stage D/E A/C Cost 


O' x r =l+V2 u 3() , 
1 ' none 


none 


none 


l[ U % 


U 4r 


,r 3 1 1 

' u 3 0 .' 



1 1 3 . 


+ -=[1, 1] 

v2 



+ 1 


xp -4x r + 8 
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The next step eliminates u 4q , from the problem. Since it does not enter the constraints, the 
cost can be optimized with respect to u 4q ,. By setting to zero the partial derivative ot the cost with 
respect to u 4q , , one can derive the formula 


l vT 
%• ' 3 %• * 3 


( 11 ) 


Equation (11) can be used to eliminate u 4q , from the problem. The resulting problem appears in 
Table 5. 


Table 5. Example problem after partial optimization of controls. 


Stage 

D/E 

A/C 

Cost 

O' 

x r = 1 + v2 u 3o , 

none 

2 2 vT . 5 

3%' + T%' + 6 

r 

none 

none 

Xy - 4 x r + 8 


The next step joins stage O’ and 1\ eliminating x r from the problem using the dynamic 
equation from stage O'. At this point, the original constrained 4-stage problem has been reduced to 
the following unconstrained 1 -stage problem: 

find: Ui_. to minimize J = - u 4 „, - -r- ui n , + — (12) 

3 w U j 0 o 

The solution is easily found, and it is u^„, = . 

'0 I 0 

Back Substitution 

Once U 3 Q , is known, the optimal values of the original variables in (8a) are found by back 
substitution. For example, u 0 is found as follows: first, u 4 is computed by Eq. (11), and u 4() , 






together with U 3 Q1 yields Ho' by Eq. (10b), u ( ) is just the lirst entry of Uo' from Eq. (9). It is Uq - 
-1/8 in this case. Other variables are found in a similar way. 

3.2. Details of General Algorithm. 

Steps and Detailed Operations for One Staee-Halving Cycle 

Step 1 . Completely QR factorize the auxiliary constraint matrix (in parallel): 


Ql k B k Q 2 k 


r R i k °i 

. 0 0J 


(13) 


where Qj and Q 2|c are orthogonal and Rj is square and upper triangular. 


This effectively transforms the control coordinates and the constraint equations, splitting the 
controls into those that are in the auxiliary constraints and those that are not, and splitting the 
constraints into those that involve controls and those that do not. Note, the matrix on the right may 
not always take the most general form shown above. The factorization effort can be reduced by 
taking advantage of any rows and columns of B k that are already zero. 

This factorization necessitates a reformulation of other problem matrices to account tor the 
reformulated control variables and constraints: 


5ui 


k 


Su 2 


kj 



(14a) 


A l, 

A 2i 


-Ql v A k> 


k 

k " lz k 
H 12 v H 22 


£2, 


= Ql k £k. [ G l k - G 2 k ]= G kQ2 k 


H ll t H 12 
T 


= Q2 k H„ Ufc Q 2k , [H„ k ,H x2k ] = H XUk Q 2k , 
k J 


*1| 

S2v 


= Q^ k 


(14b) 

(14c) 
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The subscripts 1 and 2 on the new A matrices and c vectors refer to the transformed constraints; the 
1 subscript corresponds to constraints that depend on controls, and the 2 subscript refers to 
constraints that are independent of controls. The 1 and 2 subscripts on the H and G matrices and 
on the g vectors refer to the index of the transformed control vector that gets multiplied by these 
matrices and vectors in the cost and in the state difference equations. 

Step 2. Use the first auxiliary constraint to solve for 5uj in terms of 8x k , and 
eliminate the first auxiliary constraint and 8u.i k from the problem (in 

parallel): 

5m k = -«!'*[ A, k 5x k+ c, k l (15) 


This takes the form of a feedback control law. This is the necessary feedback to stay on the 
( )j transformed constraint. After elimination of Suj and the ( )• auxiliary constraint, the 

k k k 

problem takes the form 


find: 


&£ = [ Su2 0 ,5x{,5uT t ,5x T Su ( ,5xJ,5uy 


to minimize: J = 


N 

I 

k=0 


1 r e T, T 


- | 5x. ,5u. 


-k 


fl <2 k H 22k ] 


Sxk 

5u 2 . 


- jr _t " 

5x k 


_5u 2k _ 


+ constant 


subject to: Sxq given 

Sx^i = ^k 5xk + G 2k 0 u. 2 k + z k for k = 0 ... N-l 

A 2k Sx k + c 2fc = 0 for k = 0 ... N 


( 16a) 


(16b) 

(16c) 

(16d) 

(16e) 
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which has the same form as the problem in Eqs. (2a)-(2e). The matrices and vectors with the (~) 
overstrike are derived by substitution of equations (14a)-(14c) and (15) into problem (2a)-(2e). 
They are 


f^xx k — ^xx k 


H, 1 k Rl k A, k " A,/ + [R-,‘ A,„| T H, lt R ', 1 A, 


flx2 k = H x 2 k .[R 1 1 t A lt J T H l2k 

Ix t = Sx k - fH,, k -[Ri 1 k A I /H llk )R- 1 1 k £ lt -[R- 1 1 k A l /ei k 
S 2 k = -S 2 k -H’f 2 k R 'l 1 k £l k 

?k = F k ‘ G l k R l k A l k ’ ^-k = -k ' G l k R l k £l k 


(17a) 

(17b) 

(17c) 

(17d) 

(17e) 


Step 3 . Right QR factorize G 2k to eliminate extra controls from the state difference 


equation (in parallel). 


G 2k Q3 k = [L 3k . 0] U8) 

where Q 3 is orthogonal and L 3 is square and lower triangular. 

K K 

This step can be done only if G 2k has more columns than rows. Such may not be the case 

initially, but after several halvings of the number of stages, this will generally be true. Again, this 
factorization necessitates a reformulation of other problem matrices to account for the reformulated 
control variables: 


'Su3 k 

8ll4 


-OLSIK, 


kj 

H 33 k H 34k 
H 34 . H 44 


k J 


= Qf k H 22k Q3 k . [H x3k .Hx4 k ]=Rx2 k Q3 


k’ 


£ 3 k 

S4k 


(19a) 


= Q( k l2 k (19b) 
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The 3 and 4 subscripts on the H matrices and on the g vectors refer to the index of the transformed 
control vector that gets multiplied by these matrices and vectors in the cost. 

This step is one of the most costly parts of the whole algorithm due to the large size of the Q 3 

matrix — it is usually 2n x 2n where n is the state space dimension. The operation time can be 
reduced by exploiting the special structure of the Go matrix — after several stage-halvings, the 
rightmost part of G 2 is already lower-tnangular -- and by working with individual Householder 
vectors rather than explicitly forming the Q 3 matrix. 

Step 4 . Partially optimize by computing the optimal 5iu k as a function of Sy .3 k and 
Sx^ (in parallel for all stages where step 3 has been earned out). Solve the 
equation 


H44 k 5a4 k - -IH*^ + + g4 k ] (20) 

then eliminate U 4 k from the cost function by substituting this solution for it. 

T 

This equation can be solved using a Choleskv factorization of H 44 = R 4 R 4 and back 

K k R 

substitution; R 4 is upper triangular. This implies that the H 44 matnx is positive definite. 

k k 

If H 44 k is not positive definite, then the search-step problem [Eqs. (2a)-(2e)] is ill-defined; it has 
an infinite minimum. 

An important feature of this algorithm is that it can be used to detect whether or not the 
projected Hessian is indefinite. This is done during the Choleskv factorization of H 44k - In the 

event of an indefinite H 44k> the Choleskv factorization procedure can be modified to determine a 

diagonal modification to H 44fc which ensures that the solution of the (modified) Eqs. (2a)-(2e) is a 

descent direction of the original problem in Eqs. (la)-(le). Alternatively, the modified Cholesky 



factorization procedure can be used to determine a direction of negative curvature (Ref. 14, pp. 

108-111), which could be used as a search direction. 

The solution of Eq. (20) looks somewhat like a feedback control law, in that du^ depends 

on Sx*, but the "control law" is peculiar in that 5u4 k also depends on 5u 3k . Upon substitution of 
the solution into the cost, the problem becomes 


find: 


$ 2 £ = 



to minimize: 


subject to: 







-_T _T ‘ 


Sx k -S.3 k 

> k _ 


+ constant } 


Sxo given 

5xk + i = F k 5x k + L 3k 5u 3k + z k for k = 0 ... N-l 


A 2)c 5x k + c 2k = 0 


for k = 0 ... N 


(21a) 


(21b) 

(21c) 
(2 Id) 
(21e) 


which has the same form as the problem in Eqs. (2a)-(2e). Note that there is no longer a control at 
the terminal stage -- 5 u 3n is a vector of zero dimension. The matrices and vectors with the (-) 

overstrike are derived by substitution of the solution of Eq. (20) and Eqs. (19a)-( 19b) into problem 
(16a)-(16e). They are 


" [Hx4 k ^4 k ll^x4 k ^4 k l 

(22a) 

H,j k = H 53k - [H x4k R 4 1 k l[H 34k R 4 l k l T 

(22b) 
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( 22 c) 


H33 k = H 3 3 k -[H34 k R; 1 k ][H 34 l R 4 ‘ k l T 
Sx k = ?x k ' [H k 4 k R4 k ][Rl T k £4 k ] l 22d > 

S3 k = S3 k -[H 34 k R 4 k ][R4 T k *4 k l « 2 => 


Step 5 . Halve the number of stages by using the difference equations [Eq. (21d)] 
for stages k = 0, 2, 4,..., N-l to eliminate state vectors 5x lk 8 x 3 , 8 x 5 ,..., 
8 xn from the cost and from the auxiliary constraints. Assume that N is odd 
(N = 2M+1). This is done in parallel and includes parallel message passing 
between M+ 1 pairs of processors. 


This is the concluding step in the process of halving the number of problem stages. After 
this step the problem is back in the form of Eqs. (2a)-(2e) only with half as many stages -- M+l 
stages instead of N+ 1 : 


find: 


to minimize: 


subject to: 


5 * pAT oAT oAT oAT £ A T £ A T c? A T 

u2i “ &U qjUX | ,GU. | fix 2 1- •• *0 U_ ^ | ,0?^ M ,0U j^j 


M 


J = 


k=0 


Sxq given 

A 


I 


1 r*AT«A ti 


tv 


pA 1 P A 

OX i , u 11 1 


A A 

^XU k 
A T A 

ffxu k H UUk 


+ constant} 


S£k + i = F k 5ik + G k + Ik 

A k Sx^ + B k 5^ + c k = 0 


for k = 0 ... M-l 
for k = 0 ... M 


(23a) 


px k 


1 

> 

H 

> 

H 

'5x k ' 

.8u kJ 

+ 

Sx k >^u k 

.Silk. 


(23b) 

(23c) 

(23d) 

(23e) 


where the vectors and matrices with the ( A ) overstrike are derived from substitution of alternate 
difference equations into Eqs. (21a)-(21e): 


I 
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5tL = 


6U3 


2k 


5 «2krlJ 


for k = 0 M-l, 5 u m = 5u.3 OM , Sx^ = 5x 2k for k - 0 


F k “ F 2k+l F 2k» Gk - 


F 2k+l L 3, t - L 3 


2k’ 0 2k-t-l 


> Ik = t F 2k+1^2k + I2k+J 


A 

Ak = 


A 

B t = 


A 2 2k+l F 2k 

. A2 2k J 

A2 2k+l L3 2k ® 
0 0 


A 


^k+l^k + —2 


2k+l 


£2 


2k 


fork = 0,...,M-1, B m = 


for k = 
for k = 0 M 

A2 2M+l^ 3 2M 
0 


^xx k “ ^xx2k + F 2k ^ xx 2k+l F 2k ^ or ^ 


H xu k = 


t H x3 2k + F 2k Hx x 2k+1 L 3 2k ^ t F 2k H x3 2k+ J 


for k = 


A — - j — 

^xum = [Hx3->m + F 2M 


5 2M 2M xx 2M+1 ^2M j 


H u „ k = 


l H 3 3 2k + L 32k H ’ x :k T l L? ' ;k' | L ’2k H,3 ’-k»l' 




H 3 3 


2k+l -1 


for k = 


A — J — 

H uu m = f H 33 2M + L 3 2 M Hxx 2M+l L3 2M^ 


ix k = Sx 2k + F 2 I [£ X2k+l + H XX2k+l z 2k ] for k = 


Su k " 


S3 2k + L 3 2k [ ^x 2k+1 + H xx 2k+1 l2 k ] 


-T 

^ 3 2k+ 1 + Hx3 2k+r 2k 


for k = 


^ 3 2M + L 3 2M tSx 2M+1 + H xx 2M+1 ^2m] 


.,M 

(24a) 

(24b) 

(24c) 

(24d) 

(24e) 

(24f) 

(24g) 

. (24h) 

(24i) 

(24j) 

(24k) 

(24m) 
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This step is the only step in the entire stage-number halving cycle that involves message 
passing. The quantities that get computed in equations (24b)-(24m) must all reside on a single 
processor at the end of this step, one processor for each index k. At the beginning of this step, 
however, the information associated with index 2k resides on one processor, while the information 
associated with index 2k+l resides on another processor. The binary tree structure of Fig. 1 
ensures that all such pairs of processors will have a direct connection. Thus, the M+l messages 
that must get passed can get passed in parallel. Repeated application of this 5-step stage-number 
halving procedure results in information fan-in on the binary tree — information moves up the tree 
in Fig. 1. 

Nesting and Back Substitution 

The foregoing 5-step process can be repeated until the entire problem is solved: the halved 
problem is again halved in a nested cycle. Thus, the hatted vectors and matrices at the end of a 
cycle become the un-hatted vectors and matrices at the beginning of the next cycle — note the exact 
replication of the problem (2a)-(2e) form in problem (23a)-(23e). Suppose that, initially, N = 21-1. 
That is, suppose the problem starts with V stages. After j times through the halving cycle, the 

problem will have only one stage, stage 0. During subsequent application, the cycle will terminate 
at step 4. The dimension of 5u3 0 will be zero, and the state, Sxq. is known: so, Su4 0 is completely 

determined. 

Back substitution is used to reconstruct the entire state and control time history after the 

problem halving sequence has yielded a problem with a single stage. The back substitution 

process is the reverse of the problem halving process, it is a stage-number doubling process. It 

involves repeated application of a two-step process. 

Given a certain level of problem halving, and given the 8x k and Su3 k vectors at each stage 

corresponding to that level, the first step is to determine the Suk vector at each stage for that level. 
This can be done in parallel by successive application of Eqs. (20) [to determine Su^J, (19a) [to 
determine St&J, (15) [to determine Suj j, and (14a) [to determine 811*1. Knowledge of the 8u k 

vector at each stage for the current level of halving translates into knowledge of the 
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5113-,^ ^nd 8113^ i vectors at eac h pttir °f stages for the next previous level, the level with twice as 
many stages [see Eq. (24a)]. 

The second step is to determine the 5x.|< vector at each stage tor the next previous level of 

halving. The state time history at the current level already contains half ot the desired vectors [Eq. 

(24a)]. The unknown state vectors at the alternate stages are determined in parallel from the state 

difference equations at alternate stages of the previous problem level [Eq. (2 Id) for k = 0, 2, 4, ..., 
N-l]. This can be done because each stage s 8u3 k vector at this previous level has already been 

determined. 

Message passing occurs during this back substitution procedure. The pattern of the 
information flow is a fan-out along a binary tree (traveling downward on Fig. 1). Atter each Su^ 

has been calculated at a given level — each stage on a different processor, it is broken into its two 
components, 6113^1 ^H32k+i’ an( ^ one com P onents * s sent t0 a neighboring processor. The 

other component remains on the current processor, which will handle the corresponding stage at 
the next level of doubling. 

3.3. Hybrid Multi-Stage-per-Processor Parallel QP Algorithm. The algorithm 

in the foregoing section requires as many processors as the number of problem stages. A small 
additional procedure enables the algorithm to handle problems with more stages than the given 
number of processors. Two benefits come from the improvement. First, it allows solution of a 
greater range of problems on a given hypercube, a 32-node hypercube in this research. Second, 
for typical problems, significant problem speed-up can be achieved even with more than one 
problem stage per processor. For a given number of processors, the speed-up over the best 
known one-processor algorithm grows with the number of stages in the problem. 

The working of the multi-stage-per-processor algorithm is slightly more complex than the 
single-stage-per-processor algorithm. When a single processor has more than one stage, the 
algorithm starts by working only with the last stage of this set of stages -- it must have a 
contiguous set of problem stages. It starts by performing steps 1-4 of section 3.2 on this last 
stage. Then it performs step 5 to join the last stage on the processor to the second-to-last stage. At 
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the end of this cycle, each processor has one less stage. Figure 2 shows how a 24-stage problem 
on 8 processors becomes a 16-stage problem on these same 8 processors. This goes on until each 
processor has only one stage. At this point, the algorithm can do stage halving in the manner 
presented in section 3.2. Effectively, a backwards sweep is performed on each processor before 
the stage halving process begins. Thus, this algorithm is termed a hybrid algorithm. 

The back substitution process described in the foregoing section undergoes a similar 
modification. It starts with the algorithm's original stage-doubling back substitution process. This 
terminates when each processor has received a control vector and a state vector corresponding to 
the first of its set of stages. A forward sweep of the state equation and the "feedback" control 
equations is then performed. This sweep extends to all of the stages on the given processor. 

3.4. Bench-Mark (Best Known) Serial QP Algorithm. In order to determine the 
benefits due to parallelism, a good serial algorithm is needed for comparison purposes. To make 
an honest comparison, the bench-mark algorithm has to solve the same problem as the parallel 
algorithm in the fastest possible serial wav. The fastest known serial QP algorithms for Dynamic 
Quadratic programs are those based upon the backwards sweep — their serial execution time is 
O(N), where N is the number of discrete-time problem stages. Unfortunately, the standard 
backwards-sweep algorithm, the discrete-time matrix Riccati equation, does not handle auxiliary 
equality constraints. 

The principles in section 3.2 have been used to develop a backwards sweeping algorithm 
that overcomes these difficulties. The algorithm starts with the last problem stage. It applies steps 
1, 2, and 4 of section 3.2 to this stage. Step 3 need not be performed because there is no state 
difference equation for propagation to a next stage. Step 5 is then performed to just the last two 
stages. The result is a problem with just one less stage. This process is repeated until the first 
problem stage is reached. The resulting one-stage problem is completely solved. The entire state 
and control time histories are then reconstructed using a forward sweep. 

As a further complication, assume that derivatives are calculated in parallel as in Ref. 12. 
Then this bench-mark serial algorithm must operate with gradients, Hessians, and Jacobians that 
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are distributed on different processors. If this serial algorithm runs on a single processor, the 
communication time to transfer the distributed QP information onto the single processor would 
greatly slow the algorithm. 

A better way to use the serial algorithm is to have the backwards sweep transferred from one 
processor to the next as it needs information for a different stage. For the case when there are 
more stages than the number of available processors, each processor deals with a set of contiguous 
stages whose gradients, Hessians, and Jacobian have been calculated on it. Figure 3 depicts this 
backwards sweep as it transfers between processors that have derivative information about 
different sets of problem stages. In this case, there are more problem stages than nodes. On a 
given node, the serial algorithm sweeps backwards through ail of the stages (arrows between 
numbers). After sweeping through all of the stages on a given node, it transters to the node with 
the next previous stage (arrows between nodes). 

4. Computational Test and Results 

4.1. Aeromaneuvering Test Problem. In order to test the algorithm, a specific 
nonlinear trajectory optimization problem and a linear-quadraticization of the problem about a 
guessed solution are needed. An aeromaneuvering problem from Miele and Lee (Ret. 15) has been 
used. 

The maneuver involves transfer from a geosynchronous Earth orbit (GEO) of 0° inclination 
to a circular low Earth orbit of + 28° inclination (LEO) (see Fig. 4). The maneuver that Miele and 
Lee call type 2 has been used: three propulsive impulses with plane changes and one atmospheric 
maneuvering arc. The first impulse (point 1 in Fig. 4) is nontangential; it changes the inclination, 
and it puts the spacecraft (S/C) into an elliptical transfer orbit that takes it toward the Earth. Next, 
the S/C enters the upper atmosphere 4 (point 2) and performs a deceleration and further plane 
change using aerodynamic drag and lift. No thrusting is used during this phase. Next, it exits the 

4 Atmospheric entry and exit are defined as occurring at an altitude of 120 km. 
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atmosphere (point 3), simultaneously performing its second nontangential propulsive bum. The 
maneuver ends when the S/C reaches the correct LEO altitude (point 4) and performs its final 
nontangential bum to circularize the orbit and achieve the correct final inclination. The LEO 
altitude is set at 300 km. 

State propagation has been performed in two ways. Outside of the atmosphere, the state has 
been propagated using Kepler s laws coupled with the transformations between Kepler’s elements 
and the state vector elements. Inside the atmosphere, Miele and Lee's simple models of the lift and 
drag have been used along with a table look-up of the atmospheric density (Ref. 16, p.820) to get 
the aerodynamic forces. The gravitational force is based on the spherical Earth model. The 
resulting system involves six coupled scalar ordinary differential equations — three dynamics 
equations and three kinematics equations. Inside the atmosphere, 6th-order Runge-Kutta 
numerical integration of these ordinary differential equations from time t k to time t k+1 effectively 
defines the function f(x k ,ii k ,k) on the right hand side of the system's nonlinear difference equation 
[Eq. (Id), the discrete-time system dynamics]. The control is held fixed at u^ during this 
integration, which gives a zero-order hold. The state vector at all stages is: 


x = 


-v- 

Y 

¥ 


<t> 


L0J 


(25) 


where V is inertial speed, y is flight path angle, \j / is heading angle (0° = east, +90° = north), r is 
radial distance to Earth's center, <J> is latitude (positive north), and 0 is longitude (positive east). 
This is the same state vector as in Miele and Lee (Ref. 15), where the equations of motion are 
presented. 

The definitions and lengths of the control vector and constraint vector vary with different 
problem discrete-time steps. The cost function also varies with time step. This is necessary for 
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efficient modeling of the problem. Suppose the problem is modeled by 2- 1 (— N+l) steps, labeled 
0, N. Table 6 defines the steps, the controls, the constraints, and the cost. Note that the time 
duration of step zero has been automatically adjusted to ensure that the S/C altitude is 120 km at the 
end of the step. Similarly, V c and y c have been eliminated from step N by requiring them to yield a 
circular orbit. The controls during the atmospheric portion are Cl, the lift coefficient, a the bank 
angle, and x, the actual time duration of the discrete-time step. The control Av atter the second 
impulsive bum is the change in the true anomaly until the third bum (the coast distance). 
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Table 6. Activities, controls, constraints, and cost modeling at different time 

steps of the aeromaneuvering example. 


Step No. 

Portion of Flight 

Control vector 

Constraints 

Cost 

0 

First bum and coast to 
upper atmosphere 

(VoYoVc) 5 

Perigee altitude of 

transfer orbit less 
than atmosphere 
height. 

Magnitude 

of AV 
impulse 6 

1 

Nonthrusting atmos- 
pheric flight with control- 
led lift and drag. 

(C L ,a,T) 

x > 0.25 sec; 

-0.9 < C L < 0.9; 
Heating rate < 

lxl 0 6 watts/rrr. 

0 

N-2 

4 1 

•• 

PI 

» 

N-l 

Second bum at atmos- 
phere exit and coast to 
LEO altitude 

(VcYc.Vc^v) 

r = atmosphere 
edge 

= rE + 120km; 
Y > 0; 

Yc > 0. 

Magnitude 

of AV 
impulse 

N 

Third bum to circular- 
ize and correct inclina- 
tion at LEO altitude. 

Vc 

r = LEO radius 
= r£ + 300km; 
Inclination = 28° 

Magnitude 
of AV 
impulse 


5 Defines the velocity vector after the impulsive bum. 

6 This definition is used by Miele and Lee and reflects fuel use. Total fuel use is a monotonic 
function of the sum over all stages of the impulsive velocity changes. 
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A guessed optimal solution is needed in order to obtain a linear-quadratic problem for 
solution by the parallel QP solver. This guessed solution has been generated as follows. The 
guessed solution makes a first bum to achieve the correct inclination and to reduce the perigee 
altitude to 60 km. The guess then follows this transfer orbit to perigee neglecting atmospheric 
effects. It lumps all of the aerodynamic forces into an assumed velocity reduction at perigee that 
lowers the apogee aldtude to 300 km. It then propagates this orbit to apogee where it circularizes it 
with a bum. It does no bum at atmospheric exit. All but stages 0, N-l, and N have been assumed 
to occur within the atmosphere during later linear-quadraticization. 

The controls for the bum/coast arcs are well defined in the first-guess procedure. The 
controls for the atmospheric portions have been guessed to be Cl — 05, <J = 45°, and t = time to 
traverse a fixed change in true anomaly along the guessed trajectory. 

The guessed active constraints include those inequalities that are violated by the guessed 
solution. All of the equality constraints - both the auxiliary constraints listed in the above table 
and the state difference equation equality constraints -- have been considered active. The 
multipliers associated with each active constraint have been guessed to be a positive constant times 
the signed constraint violation: this is akin to the penalty function approach. 

The cost gradients. Hessians of the Hamiltonians, and Jacobians of the active constraints are 
needed by the parallel QP solver. All of these have been generated numerically by finite 
differencing. In addition, a positive multiple of the identity matrix has been added to the Hessian 
before QP solution. This assures a positive definite problem for the QP procedure as would occur 
if this search step procedure were pan of a nonlinear programming algorithm that used the L 2 trust- 
region method. The authors do not intend to use the L^-trust region method when they eventually 
develop the full nonlinear optimization algorithm. Rather, the addition of a multiple of the identity 
matrix to the Hessian has been done because the Hessian modification capability, to which section 
3.2 alludes, has not been implemented yet. 

All of the calculations that produced the necessary QP have been done off-line on a serial 
processor, an IBM PC-XT. The results have been loaded onto the iPSC/2 nodes to test the parallel 
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QP algorithm. The timing results presented below do not reflect the times for off-line generation 
and node loading. In the future, these operations will also be done in parallel on the iPSC/2 nodes, 
but such complexity has not been necessary to the testing of the parallel QP algorithm. 

4.2. Computational Timing Results. The parallel and serial QP algorithms have been 
used to solve 8, 16, 32, 64, 128, 256, 512 and 1024 stage cases of the aero-maneuvering test 
problem. With the parallel algorithm, different numbers of processors (8, 16 and 32) have been 
used to solve the same problem. The serial algorithm of section 3.4 has been swept over 8 
processors for the 8 stage problem, 16 processors for the 16 stage problem, and 32 processors for 
all problems with 32 or more stages. Figure 5 gives timing results for both algorithms with the 
various combinations of processors and numbers of problem stages mentioned above. 

For the parallel algorithm, although the difference is minute, the 16 stage problem is solved 
faster with 8 processors than with 16 processors; the 32 stage problem is solved faster with 16 
processors than with 32 processors. This means that having 2 stages per node initially and one 
less overall stage-halving cycle is better than starting with a single stage per node. Although this 
improvement might not occur for different problems, a saving of half as many nodes as the number 
of stages could be a significant economy for problems with many stages. 

The speed-up of the parallel algorithm with 32 processors over the serial algorithm is shown 
in Fig. 6. For the serial algorithm, a problem with twice as many stages takes twice the time (see 
Fig. 5). For the parallel algorithm, as the number of stages becomes large, the early work of 
backwards sweeping through multiple stages on a single processor dominates the computation 
time; the amount of work doubles as the number of stages doubles. Thus, the curve in Fig. 6 will 
eventually level out as the number of problem stages increases. The slope of the curve is already 
decreasing after 256 stages. Note that the speed-up reported on Fig. 6 would be slightly less if the 
serial algorithm could work with information that was already concentrated on a single processor. 

The achieved algorithm speed-up is significant. Figure 6 shows an 8-fold speed increase for 
a 1024-stage problem when solved on 32 processors. In considering this speed-up, recall that the 
parallel algorithm is being compared to the best known serial algorithm. This is the most 
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conservative way of determining speed-up. A less conservative measure ot speed-up due to 
parallelism would compare the parallel algorithm executed in parallel with itself executed serially. 
In this case, the speed-up factor would be 29, a 91% efficiency, on the 1024-stage problem. 

Efficiency aside, an important characteristic of this algorithm is its ability to reduce wail- 
clock time for problems with a large number of stages. The lowest curve on Fig. 5 illustrates this 
ability to maintain low wall-clock time by increasing the number of processors as the number of 
problem stages grows. 

The only other known parallel algorithm for similar problems is that developed by Wright 
(Ref. 13). Wright's algorithm scales with problem size the same way that the present algorithm 
scales with problem size. Wright s algorithm probably is faster than the present algorithm by a 
scale factor that is independent of problem size. This is because it is a simpler algorithm: it 
"divides" in the same way, but it "conquers" each sub-problem more efficiently. Wright's 
algorithm is probably about 4 times faster than the present algorithm on problems that can be 
solved by both algorithms. Unfortunately, Wright's algorithm cannot handle auxiliary state 
constraints; therefore, it could not be used to solve this example problem to provide a comparison 
between the two parallel algorithms. 

5. Additional Discussion of Algorithm 

5.1. Relationship of Stage-Halving to Control Concepts. The stage-halving 
algorithm of section 3.2 can be further illuminated by considering it in the control context. 
Suppose that the states at certain points of time before the end time are totally prescribed. In this 
case, the original optimal control problem divides neady into independent sub-problems, which can 
be solved in parallel. After these are solved, the original cost of the total problem can be 
computed. This cost will depend on the values prescribed for the states at the interim points where 
the states are fixed. These states can then be varied to optimize the total cost. The resulting 
solution will solve the original problem. This idea, coupled with the idea ot algorithm nesting, 
yields the approach of section 3.2. This concept is also found in Ref. 11. 
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5.2. Bottleneck Performance. In general, a parallel algorithm can be subject to 
bottlenecks. For instance, suppose one stage has more auxiliary constraints to factorize than does 
another stage. This wiil delay the other stage at some point by making it wait to combine (step 5) 
with a slower executing stage. One might conjecture that such effects get compounded to slow the 
overall stage-halving algorithm. 

This issue has been investigated. The time to execute each step of the algorithm on each 
processor has been determined for example problems. These results show that delays occur, but 
they do not get compounded. The total execution time of the algorithm is no slower than the sum 
over all of the stage halvings of the execution time of the slowest stage at each levei of stage 
halving. 

5.3. Compatibility with Parallel Gradient, Jacobian, and Hessian 
Calculations. The algorithm of section 3 brings an additional benefit (beyond its real-time 
speed-up factor) to the solution of nonlinear trajectory optimization problems. It is well suited for 
use in conjunction with parallel gradient and Jacobian calculation. References 12 and 13 both point 
out the possible advantage of calculating the gradients of the cost function and the Jacobians of the 
constraints in parallel. This also turns out to be a stage-wise parallelization. After calculation, 
these matrices and vectors get used by the current algorithm in the same distributed manner as the 
manner in which they are generated -- no additional message passing is needed. This is the reason 
that the bench-mark serial algorithm has been constrained to use distributed derivative information. 

5.4. Numerical Stability. One point of caution concerns the numerical stability of the 
algorithm. If the matrices for k = 0, 1, 2, N-l are large compared to 1 (if the open-loop 

system is wildly unstable), then numerical problems can occur. The algorithm includes repeated 
multiplication by these matrices. Other trajectory optimization algorithms seem to suffer from this 

same drawback. The authors do not know of any examples where this becomes a problem. 

Alternatively, if A \ in Eq. (15) is very large compared to Rj k> then numerical instability can 

occur. This can happen when the overall nonlinear programming algorithm, of which this 



algorithm is a part, allows constraint violations in the torm of slack variables that later get penalized 
in the cost. 

If either of these problems occur, then steps 1, 2, and 5 can be altered to yield a parallel 
version of the standard null space method. Such an algorithm would be guaranteed to have 
numerical stability. This alteration would result in a slower algorithm, but its execution time would 
still scale as n 3 log 2 N on N processors. 


6. Conclusions 

An algorithm has been presented that can solve an equality-constrained time-varying discrete- 
time LQR problem. It uses a hvpercube message-passing parallel processor to solve the problem 
in 0(n 3 log 2 N) operations on N processors, where N is the number of problem stages and n is the 

number of state vector elements. It can also solve problems with more stages than the number of 
processors -- it solves an N stage problem on p processors in 0(n 3 {[N/p]+log2p}) operations. It 

uses a stage-wise parallelization, divide and conquer approach. It is a specialized form of a 
technique known as domain decomposition. It can handle auxiliary equality constraints. Such 
auxiliary constraints would arise from state or control equality or inequality constraints if the 
present algorithm were used to calculate the search direction as part of an active-set nonlinear 
trajectory optimization algorithm. 

In order to test the algorithm, an aeromaneuvering problem has been solved. The problem, 
being nonlinear, has been linearized about a guessed solution. A good serial algorithm that can 
handle auxiliary equality constraints also has been developed for fair evaluation ot the parallel 
algorithm's speed. A significant speed-up of the parallel algorithm over the serial algorithm has 
been achieved - a factor of 8 for a 1024-stage problem on 32 processors. With more processors 
available, a higher speed-up should be possible. 

The algorithm has been developed as part of an approach to the problem of doing nonlinear 
trajectory optimization on line. It reduces the wall clock time that a trajectory optimization 
algorithm must spend to solve for a search direction. 
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Abstract. An algorithm has been developed to solve quadratic programs that have a dynamic 
programming structure. It is being developed for use as part of a parallel trajectory optimization 
algorithm. The dynamic quadratic programming algorithm has been developed to achieve 
significant speed without sacrificing numerical stability. The algorithm makes use of the dynamic 
programming problem structure and the domain decomposition approach. It parallelizes the 
orthogonal factorization null-space method of quadratic programming by developing a parallel 
orthogonal factorization and a parallel Cholesky factorization. Tests of the algorithm on a 32-node 
INTEL iPSC/2 hypercube demonstrate speed-up factors as large as 10 in comparison to the fastest 
known equivalent serial algorithm. 


Key Words. Quadratic programming, dynamic programming, orthogonal factorization, parallel 
algorithms, domain decomposition. 



1. Introduction 


The objective of this paper is to develop and test a parallel algorithm for solving equality- 
constrained quadratic programs that have a dynamic programming problem structure. The 
algorithm must be rapid, it must have good numerical stability, and it must be able to detect an 
indefinite projected Hessian. Also, the algorithm must be able to determine a feasible direction of 

negative curvature when the projected Hessian is indefinite. 

This algorithm has been designed to be part of a sequential quadratic programming-type 
nonlinear trajectory optimization algorithm (Refs. 1 and 2). The dynamic quadratic programming 
(DQP) algorithm must execute very rapidly because the parent nonlinear trajectory optimization 
algorithm may be used to provide real-time guidance commands to an aerospace vehicle. 

The nonlinear programming (NP) core of the parent trajectory optimization algorithm requires 
the solution of an equality-constrained quadratic program (QP) each time it generates a search 
direction (Ref. 2). The characteristics of the core NP algorithm dictate the required good numerical 
stability of this paper's QP algorithm. A general variable reduction null-space method may lead to 
extreme ill-conditioning when used in conjunction with the core NP algorithm of Ref. 2, but an 
orthogonal factorization null-space method will have satisfactory numerical stability. The NP core 
also dictates the required ability to detect and determine feasible directions of negative curvature 


when they exist. 

A general form of a dynamic quadratic program is 
find: 

N 


to minimize: J = {^H k x k + gl*k) 


k=0 


(la) 

(lb) 


subject to: E k x k + e k + F k+ i x k+ i + f k +i - 0 f° r k 0 ••• ^ c 

D k x k + d k = 0 for k = 0 ... N (Id) 

The index k refers to the stage or time-step number. There are a total of N stages. The large 
solution vector, X, is a composition of components from each stage, Xo, Xi, x 2 , ..., and Xn- 
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/he dimension of the solution vector x^ at a given stage is n^. This dimension may vary from 
stage to stage. In an optimal control problem, the vector Xk will be a composite of the "state" and 
"control" vectors at time-step k. 

The cost function J in Eq. (lb) is a summation of stage- wise costs, each of which includes 
both linear and quadratic terms. The cost is de-coupled: the cost at any given stage depends only 
the solution vector at that stage. The n^xn^ matrix Hk is the cost Hessian for stage k and the n^xl 
vector git is the cost gradient for stage k when Xk = 0. 

Two types of linear constraints enter the problem, constraints that link neighboring stages, 
Eq. (lc), and constraints that involve only the solution vector of a given stage, Eq. (Id). The 
former constraints usually arise from dynamic models, and they provide the only coupling between 
the stages. The matrices Ek and Fk+i are the Jacobian matrices of the dynamic constraints. The 
Dk matrices are the Jacobians of the single-stage constraints. The fixed vectors dk, ek, 
and ffc+i arc the nonhomogeneous constraint terms. The vectors ek and fk+i could be added to 
form a single nonhomogeneous term for each dynamic constraint, but the more general form in Eq. 
(lc) has been chosen to facilitate bookkeeping for the parallel domain decomposition algorithm that 
is developed in this paper. The number of constraint equations may vary from stage to stage, but 
the total number of constraints will normally be less than the dimension of X (the total number of 

unknowns). 

The cost function in Eq. (lb) can be aggregated into two large terms, one quadratic and one 


linear: 



Ho 

Hi 

H 2 


0 


0 Hn.i 

H n 


X + [gO T >gl T >g2 T ,--,gN-l T ,gN T ] X 


( 2 ) 


where cost de-coupling is reflected in the block-diagonal sparsity structure of the Hessian. 
Similarly, the constraints in Eqs. (lc) and (Id) can be formed into one large vector of equalities: 
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i 



0 


Do 

Eo Fi 

Di 

Ei F2 



0 En-i Fn 

Dn 


eN-i+^N 

dw 


( 3 ) 


The sparsity structure of this matrix is called a staircase structure for obvious reasons. Much 
research has already been done on how best to exploit this structure, especially for the linear 
programming case (e.g., Refs. 3 and 4). 

Wright (Ref. 5) and Psiaki and Park (Ref. 6) have both developed parallel algorithms for 
solving time-varying Linear Quadratic Regulator-type control problems. They are applicable to a 
special form of problem (la)-(ld) where F^ = [I, 0] for k = 1, ..., N and where the solution 
vector consists of a state and control vector, x^ = [xk T ,Uk T ] T . When run on a distributed memory 
parallel processor, these algorithms offer significant savings in wall clock time in comparison to 
the fastest known serial algorithms. Wright's algorithms, although faster than the algorithm of 

Psiaki and Park, have the further restriction that they cannot handle single-stage state constraints, 
they are restricted to single-stage constraints of the form D* = [0, D U)c ]. Also, none of Wrights 

algorithms can detect an indefinite projected Hessian, and they may yield a cost ascent direction 
when the Hessian is indefinite. 

Unfortunately, the previously-developed algorithms are unsuitable for use with the core NP 
algorithm in Ref. 2. Wright's algorithms' inability to deal with general D* matrices and their 
inability to guarantee descent make them unsuitable. Psiaki and Park's algorithm does not have 
good enough numerical stability because parts of it use the variable-reduction null-space method. 
Therefore, a new parallel algorithm is needed. 

The parallel algorithm developed in this paper is based on the domain decomposition 
approach developed by Psiaki and Park (Ref. 6). It is designed to run on a distributed-memory, 
message-passing parallel computer. The new algorithm is very similar in spirit to the algorithm of 
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Ref. 6. The main difference is in the exclusive use of the orthogonal factorization version of the 
null-space method. This provides the requisite numerical stability, but makes the algorithm use 
more memory. Additionally, this paper defines a method of determining directions of negative 
curvature when the projected Hessian is indefinite. 

The remainder of this paper is divided into two sections plus conclusions. Section 2 
develops the new DQP algorithm. It starts by developing an algorithm for the case where the 
number of processors equals the number of stages, N. Later, it generalizes to the case of fewer 
processors than problem stages. Section 3 presents a test problem and computational timing 
results using an INTEL iPSC/2 32-node hypercube. These results demonstrate the scaling of the 
wall clock execution time with the problem size. 


2. Algorithm Design 

2.1 Review of Orthogonal Factorization Null-Space Method. The algorithm to 
solve problem (la)-(ld) is a special parallel form of the orthogonal factorization null-space method 
described in Ref. 7. When solving dense problems on a serial processor, this method begins by 

performing a right (or LQ) orthogonal factorization of the large constraint matrix in Eq. (3): 

f D ° 1 

Eo F t 0 


Di 

Ei F2 
0 


Q 


En-i Fn 


Dn J 


[L,0] 


(4) 


where Q is an orthogonal matrix and L is a square lower-triangular matrix. This is done using 
Householder transformations, and the form in Eq. (4) can always be achieved when the dimension 
of X exceeds the number of problem constraints. The Q matrix transforms X into two 


components: 

Z 1 rp 

= q t x 

*2 


(5) 
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where Z 2 is in the null space of the constraints. 

After transformation, the original problem takes the form 


find: 

to minimize: 


subject to: 


Z] and Z 2 



H U Jin 

§€12^ 5^22 


Lzi + 


do 

eo+fj 

di 


= 0 


eN-i+fN 

- dN - 


Zi 

+ [9i T 92 T ] 

Zi 

_ z 2. 


_ z 2_ 


(6a) 

(6b) 


(6c) 


where the transformed Hessian and gradient are 

Ho 


Hn H 12 

«,2 T 


= Q 


Hi 


0 


0 Hn-i 


H n J 


Q 


(7a) 


91 

92 


= Q 


go 
gl 

gN-1 

L- gN -J 


(7b) 


If the constraints are non-degenerate, then the transformed problem in Eqs (6a)-(6c) is easy 
to solve. L is nonsingular in this case, and Z] is determined uniquely by the constraints: 

r d o 


Zi = - L 


-1 


eo+fi 

di 

eN-i+fN 

L dN — I 


( 8 ) 
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The null-space coordinate Z2, which does not enter the constraints, can be optimized by setting 
dJ/dXj = 0 . Solution of this expression for Z2 yields 

Z 2 = -H22' 1 [Hi 2 T Zi +92] ( 9 ) 

where H22 must be positive definite for the original problem to have a unique local minimum. In 
this case, §€22 1 can be computed using the Cholesky factorization of JC 2 2- The final problem 
solution X can be computed from Zj and Z2 by inverting the orthogonal transformation in Eq. 

( 5 ). 

The technique just described does not account for the sparsity or special structure of the 
problem matrices in Eqs. ( 2 ) and ( 3 ). A straight-forward application of this technique is likely to 
result in the generation of dense matrices Q, L, ft\\, ft 12, and f€ 22 - This would lead to 0 (N 2 ) 
storage requirements, and the orthogonal and Cholesky factorizations would require O(N^) 
operations. As N, the number of problem stages, becomes large, a standard technique becomes 
inefficient or even infeasible. 

2.2 Overview of New Algorithm. The technique of this paper uses the domain 
decomposition principle to develop a sparse, parallel orthogonal factorization technique for solving 
problem (la)-(ld). It's memory requirements and operation count are both O(N) for an N-stage 
problem. When run on p (<N) processors with distributed memory and message passing 
capabilities, the wall clock time to execute the algorithm is 0 [(N/p) + log 2 (p)]. 

The general idea of the algorithm is to perform the steps of the standard LQ null-space 
method in a piecemeal, stage-wise manner, where possible, to avoid creating fill in the resulting 
matrix factors. Different processors deal with different stages at the same time. First, it LQ 
factorizes all of the single-stage constraints, Eq (Id) for k = 0 ,...,N. This can be done in parallel 
and results in a transformation of the x k at each stage. Part of the transformed x k is completely 

determined by the constraints, and the single-stage constraints along with some of the unknowns 
are eliminated from the problem. At this point the only remaining constraints are those that link 
neighboring stages. These can be LQ factorized at each stage to further transform the remaining 
unknowns. Some of the transformed unknowns do not enter any constraints. All of the preceding 



operations can be done without disturbing the block-diagonal structure of the large Hessian matrix 
in Eq. (2). 

The next operation optimizes the unconstrained variables. The partial derivatives of the cost 
with respect to these variables are taken. These partial derivatives are set equal to zero, and the 
resulting equations are solved for the unconstrained variables in terms of the remaining constrained 
variables. This solution is substituted back into the cost expression to eliminate the unconstrained 
variables from the problem. This can be done at each stage in parallel. At the end of this operation 
the dynamic QP has been transformed into one that has the same number of stages as the original 
QP, but it has no single-stage constraints, and it has fewer unknowns at each stage. 

One final operation is performed; then, the above steps are repeated. The final operation is to 
reduce the number of stages by a factor of 2 by combining adjacent stages into a single larger 
stage. This operation increases the number of variables per stage and makes some of the remaining 
constraints into single-stage constraints. Message passing must occur between pairs of processors 
at this point. The resulting problem is in the form of Eqs. (la)-(ld), and the algorithm can be re- 
applied in a nested fashion. The resulting flow of data between processors in the nested 
application of the algorithm is a fan-in along a binary tree, with a processor at each node (Ref. 6). 

As in Ref. 6, the problem eventually boils down to a single-stage problem, which gets 
completely solved. The solution ot the single-stage problem then gets back-substituted into a 
double stage problem and so on in a stage-doubling operation. This back substitution process 
completes the determination of the variables that have been eliminated during parallel, stage-wise 
partial optimization. This process includes message passing in a fan-out along a binary tree of 
processors. 

2.3 Algorithm for Solving N-Stage Problem on N Processors. The following 
algorithm solves problem (la)-(ld), and it also determines the Lagrange multipliers for the 
constraints in Eqs. (lc) and (Id). The algorithm assumes that the constraints are not degenerate 
and that the projected Hessian is positive definite. Section 2.4 describes modifications that allow 
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determination of feasible directions of negative curvature when the projected Hessian is indefinite. 
The necessary modifications to handle degenerate constraints are discussed in Section 2.5. 


Steps and Detailed Operations for One Stage Halving 


Step 1 . LQ factorize the D k constraint matrices (in parallel): 

D k Qi k = [L lk 0] for k = 0,...,N (10) 

where Qi k is orthogonal and Lj k is square, lower triangular, and 

nonsingular. 


This effectively transforms the x k vectors, splitting them into components that are in the 
single-stage constraints, call these Xj k , and components that are not, call these X 2 k : 

'xik 1 


L*2kJ 


= Qi T k ** 


for k = 0,...,N 


(ID 


This factorization necessitates a transformation of other problem matrices and vectors consistent 
with the stage-wise variable transformations: 


[F lk F 2k ] = F k Q, k 


for k = 1,...,N 

(12a) 

[Ei k ®*2 k ] s E k Qi k 


for k = 0,....N-1 

(12b) 

rHi, k Hi2 k i T 

Si k~| 

UJ Qu8k 

for k = 0 N T 

(12c) 


The 1 and 2 subscripts on the F, E, and H matrices and on the g vectors refer to the index of the 
transformed x vector that gets multiplied by these matrices and vectors in the cost and in the 
constraint equations. In large sparse matrix form, the transformed constraint Jacobian matrix 
becomes 
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Elo e 2 0 Fij F 2l 


Ei i E 2 j Ei 2 F 22 


ElN-l E 2N-1 F 1N F 2i 


which still has staircase structure. The transformed Hessian becomes 

"H ll0 Hi2 0 

HiI 0 H 22o 

H ll 1 H l 2l 

Hil, H 22i 


Hun H i2 N 
Hi2n H 22n 


which still has the block-diagonal structure. 

Step 2 . Use the single-stage constraints to solve for the constrained Xi k and 
eliminate the single-stage constraints and the Xi k from the problem (in 


parallel): 

Xi 1, - - L. 1 1, di 


for k = 0 N 


After elimination of the Xi k and the single-stage constraints, the problem takes the form 


X 2 q, X-2 X 2 •••* ^2 j 


'minimize: J = ^ { V x 2 T k H 22 k *- 2 k + g7 k x 2ll 


(16a) 


(16b) 


subjectto: E 2k x 2k + e k + F 2k+1 x 2k+1 + f k +i =0 fork = 0...N-l (16c) 

which is similar in form to the problem in Eqs. (la)-(ld) only without any single-stage constraints. 
The new problem is derived by substitution of Eqs. (10), (1 1), (12a)-(12c), and (15) into problem 

(la)-(ld). The vectors with the (~) overstrike are 

§ 2 k = g 2 k • H,I k Lj' k d k for k = 0.....N (17a) 


(17a) 
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®k — ®k ■ Ei k L\ k dk, f k+i — fk+i ■ Ei k+iL] | c+ jdi c+ i fork 0,...,N-1 (17b) 

Step 2 eliminates all of the rows and columns that contain an Li k matrix from the constraints 
in Eq. (13). Also, it eliminates all of the rows and columns that contain an Hn k matrix from the 
Hessian in Eq. (14). In large sparse matrix form, the new constraint Jacobian and Hessian for the 
problem in Eqs. (16a)-(16c) are, respectively, 

r E20 F 2l 0 

E21 F22 

0 E 2 N .i E 2 N 





Step 3 . LQ factorize 



to eliminate variables from the remaining constraints by 


forming zero columns in the transformed Jacobian matrix: 


E 2 qQ 2 o = [ L4 ° ® ] 



(19a) 

r F 2 k ] n fL 3k 0 0-1 
LE2j Q2k= LE3 k L 4k oJ 

for k = 1 , 

...,N-1 

(19b) 

F 2 N Q 2 N =[ L JN»] 



(19c) 


where the Q? are orthogonal and the L 3 and L 4 are square and lower 

^ k k k 


triangular (in parallel). 


This step can be done only if the matrix to be factorized has more columns than rows. For 
general dynamic QPs this may not be true initially, but this factorization will become possible after 
several halvings of the number of stages. 

Step 3 effectively transforms each X 2 k vector and splits it into three components. One 
component enters two constraints, the constraint that couples the current stage (k) to the preceding 
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stage (k-1) and the constraint that couples the current stage (k) to the following stage (k+1); call 
this component % 3k . It is a "state "-type component in the terminology of control theory. Another 

component enters only the constraint that couples the current stage (k) to the following stage (k+1); 
call this component X 4 k . It is a "control "-type component. The third component, Xs k , enters 


none of the constraints. One might call it an "ineffective control". The transformations are 
’* 4 0 


L*5oJ 

X 3k ' 

X4 k 

l_*5 k J 

x 3n 

X 5n_ 


= Q2o x 2o 


= Q 2 T k x 2k 


- Q2nX 2n 


for k = 1,...,N-1 


(20a) 


(20b) 


(20c) 


Again, this step necessitates a transformation of other problem matrices and vectors to account for 


= Q2 T 0 H22 0 Q2o- 


the transformed x variables: 

H 44 Q H 45 Q 
_ H4JQ H55Q _ 

H 33k H 34k H 3 5 k 


84 0 

85 0. 


= Q 2 T o 82 o 


H 3 4 k H44 k H45 k 

LH 3 I k H 4 t k H„J 

h^ m h 


= Qj k H2 2lc Q 2k . 


83 k 
84 k 
L85 k J 


= Q 2 T k 82 k for k = 1 N-l 


33 N n 3 5 N 


,h 3 I n h 


55 N J 


= Q2 1 nH22 N Q2 N ’ 


'83 n 
85n 


- Q 2 n 82 n 


(21a) 

(21b) 


(21c) 


The 3, 4, and 5 subscripts on the H matrices and on the g vectors refer to the index of the 
transformed x vector that gets multiplied by these matrices and vectors in the cost. After the step-3 
orthogonal transformations, the large sparse form of the Jacobian becomes more sparse, but it 


retains its staircase form: 
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“ L.4q 0 L3 j 0 0 

E 3 J L4| 0 L 32 0 0 

E 3 , L 49 0 


0 


L3 N .! 0 0 

®*3n.] 0 L 3 N 0 


(22) 


and the transformed Hessian retains its block-diagonal form 
r H44 0 H 450 

H 4 I 0 ^55o h h h 

H331 “34 j “35 1 
H 3 X 1 H44j H 45 j 

h 3 I 1 h 4 Jj h 55i 


0 


H33n-, ^34n_j H35n_i 

J 

H 34 N 4 H44 N .j H45 N1 

J J 

H35n-1 ^45n_i H55n_i jj H 


35n 


h^ m h 


35 N «55 N _l 


(23) 


Step 4 . Partially optimize by computing the optimal Xs k as a function of % 3 k and 
X 4 k (in parallel for all stages where step 3 has been carried out). Solve the 

equation 

H55 k *5 k = -[ H 35 k *3 k + H 4 J k X 4k + g5 k ] ( 24 ) 

then eliminate Xs k from the cost function by substituting this solution for it. 

This equation can be solved using a Cholesky factorization, Hs 5 k = L 5 k L 5 k , and back 
substitution; Ls k is square and lower triangular. The Hs 5 k matrix constitutes a block on the 
diagonal of the projection of the Hessian of the original problem onto the null space of the 
constraints. It must be positive definite for the original QP to have a finite, unique solution. Given 
the assumption of a positive definite projected Hessian for the overall problem, the Cholesky factor 
L 5k exists and is nonsingular, and Eq. (24) has a unique solution. (Section 2.4 explains useful 


I 
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, -a in the case of an indefinite or semi-definite H 5 5 k -) Upon 

calculations that can be performed in the case 

substitution of the solution X 5k in® the cost, the problem becomes ^ 

find: 

to minimize: 

N-l r _ - “I r.. 1 

> 


X„ 0 , X},, X*,, X3 2 , X, 2 *«N-P and 

j = 1 xJ 0 H44o X 4 0 + 840^40 


subject to: 


N-l 

k=l 


r t „t 1 

H 3 3 k H 34 k | 

~ X 3 k _ 

1 + [g 3 T k>g 4 T k ] l 

‘ X 3 k ~ 

[ X 3k *4 k J 

H 34 k H 44 k J 

_ X 4 k _ 


L X 4k. 


+ i-X3 T N H 3 3 N X 3N + g3 T N X 3N 

L4 0 ’ C 4 0 + eo + L3iX3 1 +fl - 0 

E 3k X3 k + L4 k X4 k + ek + U3 k+ l x 3 k+ i + ^ +l “ 0 


(25b) 

(25c) 

for k = 1 ••• N-l 
(25d) 


H 33k = H 33k - H 3 5 k L 5 k H 3 T k 

H 3 4 k = H 3 4 k - H 3 5 k L5lU5 l k H4l k 

H 44k = H 4 4 k - H 4 5 k U5l L 5 1 k H 45 k 

T -1 

g3 k = 83 k - H 3 5 k L 5 k L 5 k g5 k 
g4 k = g4 k * H45 k L 5lU5 k g5 k 


solution of Eq. (24) into the resulting 

e are 


for k = 1 N 

(26a) 

for k = 1,—sN-l 

(26b) 

for k = 0 N-l 

(26c) 

for k = 

(26d) 

for k = 0,...,N-1 

(26e) 

(22) all those columns 

that contain only 


4 eliminates irom urc w..— •» — 

— . Also, i. eliminates all of the corresponding tows and columns from the He"- 

Eq (23), which are the rows and columns that contatn an H 55k matn, 

4 ‘ The new large sparse forms of 

remaining nonzero blocks of the large spane Hessian g (25a) . (25d) are. respectively, 

a r*r\c\ Hessian for the problem in Eqs. (2ba) ( ) 

the constraint Jacobian and cost Hessian tor v 
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0 


“ L*4q L*3j 

E3 1 L» 4 j L32 

^32 ^^2 

0 ^ 3n 1 

E3N-1 ^4N-1 ^3 n J 

" H 44o 0 

H 33 j H 34 , 

H34J H44J 


0 


H33N.1 H 34 N .] 

H34n-1 H 44 N 4 

H33 N _ 


(27a) 


(27b) 


The Jacobian is still a staircase matrix, and the Hessian is still a block diagonal matrix. Steps 
1 through 4 do not involve any calculations which require data to be passed between stages, nor do 
they create any new coupling in the problem matrices. 


Step 5 . Halve the number of stages by joining each even-numbered stage, stages 
0,2,4,...,N-1, to the odd numbered stage that follows it. Assume that N is 
odd (N = 2M+1). This is done in parallel and includes parallel message 
passing between M+l pairs of processors. 


This aggregation of adjoining stages amounts to a re-definition of the boundaries between 
stages in the large sparse problem matrices. Stages 2k and 2k+l get joined together into a 
"hyperstage". The dashed vertical lines in the following large sparse Jacobian show how step 5 
super-imposes on the Jacobian of Eq. (27a) boundaries between a new "hyperstage" and its 
neighboring "hyperstages". 
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( 28 ) 


_ 1<4q L 3 J 

1 


l 

t 


0 

e 3i 

1 


1 

1 



^2k-l, L 32k 


1 

{ 

I 




I 

( E 32k E< *2k 

^32k+l 

1 

l 




1 

1 

^3 2k-*- 1 

^ 4 2k+l.^ 3 2k+2 



0 

1 

t 


1 

I 

l 

^3n-1 


1 


i 

E 3N-1 

E 4 j>j_] L3 N _ 


Similar boundaries between new "hyperstages" can be drawn to split up the Hessian matrix of Eq. 
(27b): 

H44 0 ( ! 


H4421C-1 


0 


0 


H33 21c H342k 
H?4 2 k H 44 


2k 


^332k+l ^342k+l 

'Y 

^342k+l H442k+1 


(29) 


I H 332 k +2 


H 3 3 N -I 


Step 5 is the concluding step in the process of halving the number of problem stages. After 
this step the problem is back in the form of Eqs. (la)-(ld) only with half as many stages, M+l 

stages instead of N+ 1 : 
find: 


AAA A 

Xo, xi, X 2 , .... and x M 


M 


to minimize: J 


= ^ ( jXkHkXk + gk*k) 


(30a) 

(30b) 


k=0 
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subject to: Ek *k + e k + E k +i **+1 + fk+i =0 fork-0... M-l 


(30c) 


Dk x k + d k = 0 


for k = 0 ... M 


(30d) 


The vectors and matrices with the ( A ) overstrike are derived by grouping of neighboring pairs of 


stages: 


Xo= x 3l I, x k = 


X 32k 

X4 2k 

X 32k+l 

x 4 2 k+i 


for k = Xm = *4 


F k = [ L 3 2k ’ 0 ’ 0 ’ 0 ] fork = F m = [L 32M , °’ °] 

Do = [L>4 o> L 3i , 0], Dk= [E 32k , L4 2k » L 32k+1 , 0J for k = 

A 

Dm s [E3 2 M> ^ 4 2M’ ^3 2 m+i] 

E 0 s[0,E 3i ,L 4i ], E k s[0,0,E3 2k+1 ,L4 2 k +1 ] for k = 1,...,M-1 


d k = e 2 k +f 2 k+i for k = 0,...,M, 


e k = e 2 k+i for k = 0,...,M-1, 


fk = f 2 k for k = 


^ H 44 Q 0 0 

H 0 = 0 H 33 j H 34 1 

_ 0 H34] H 44 j 


Ht = 


33 2k H 34 2k 


Hl 42k H 4 4 2k 0 0 

0 0 H33 2k+1 H 3 4 2k+1 


Hm= H 


0 OH 

^33 2 m *^34 2 m 
T 

134 2 m ^44 2 m 

0 0 1 


342k+l M 44 2k +1 


for k = 


0 H 33 2 m+i 


g0 = g3 1 • 8k = 


I 83 2k+ i I 
^ S4 2k-t-l 


for k = 1 1 , gM = g 4 


83 2M 


83 2 M+] 


(31b) 


(3 Id) 


(31g) 


(3 lh) 
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As in the algorithm of Ref. 6, this step is the only step in the entire stage-number halving 
process that involves message passing. The quantities that get computed in Eqs. (31b)-(31i) must 
all reside on a single processor at the end of this step, one processor for each index k. At the 
beginning of this step, however, the information on the right-hand side of the equations associated 
with index 2k resides on one processor, while the information associated with index 2k+l resides 
on another processor. A binary tree structure of inter- processor connections ensures that all such 
pairs of processors will have a direct connection, and the M+l messages can get passed in parallel 
(Ref. 6). A binary tree can be imbedded in any hypercube processor. 

Nesting . The foregoing 5-step process can be repeated until the entire problem is solved: the 
halved problem is again halved in a nested cycle. Thus, the hatted vectors and matrices at the end 
of a cycle become the un-hatted vectors and matrices at the beginning of the next cycle -- note the 
exact replication of the problem (la)-(ld) form in problem (30a)-(30d). Suppose that, initially, N 
= 2>-l. That is, suppose the problem starts with 2 ) stages. After j times through the halving cycle, 
the problem will have only one stage, stage 0. Data will have fanned in to a single processor along 
a binary tree with j levels of branches (Ref. 6). 

The final single-stage problem can be solved by a similar technique. Steps 1, 2, and 4 of the 

above cycle can be applied. There is no need for step 3 because there cannot be any multi-stage 
constraints. The vectors X 2 0 and % 5 0 are equivalent, and the vectors X 3 0 and x 4q have zero 
dimension. The vector X 5 0 can be determined from Eq. (24), and the vector Xi 0 can be 

determined from Eq. (15). Finally, the vector Xo can be determined from Eq. (11). 

After the first stage-halving cycle, further computational efficiency can be realized if attention 
is paid to the structure of Fic in Eq. (31b) and of in Eq. (31c). They both contain zeros, and 
these zeros can translate into savings during the LQ factorizations ot steps 1 and 3 in the above 
cycle. The timing results reported in Section 3 are for an algorithm that includes this time-saving 
feature. 

Back Substitution and Stage Doubling . Back substitution is used to reconstruct the entire 
solution history after the nested problem-halving sequence has yielded a problem with a single 
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stage and after that single-stage problem has been solved. The back substitution process is the 
reverse of the problem- halving process, it is a stage-number doubling process. Given a certain 
level of problem halving, and given the Xj^ and X4 k time histories corresponding to that level, the 

first step is to determine the x k time history for that level — recall that x k denotes the solution at 
stage k prior to the two orthogonal transformations and partitionings that occur in steps 1 and 3. 
The vector x* can be determined in parallel by successive application of Eq. (24) [to determine 
*5 k ], Eq. (20) [to determine x 2k ], and Eq. (11) [to determine x k - Xi k has already been 

determined during the stage-halving process via Eq. (15)]. Knowledge of the x k time history for 

the current level of problem halving translates into knowledge of the 
X 3 2k , X 4 2k , *32k+i> and X4 2k+i time histories f° r the next previous level - the level with twice 

as many stages -- [see Eq. (31a)]. The stage-doubling algorithm can then repeat itself. 

Message passing occurs during this back substitution procedure. The pattern of the 
information flow is a fan-out along a binary tree. After each x k has been calculated at a given level 
— each stage on a different processor, it is broken into it two pairs, (X3 2k ,X4 2k ) and 
(X3 2 k + 1 ,x 4 2 k+ 1 ) • O ne of the P airs is sent to a neighboring processor. The other pair remains on 
the current processor, which will perform computations for the corresponding stage at the next 
level of doubling. 

Multiplier Computation . One often needs to determine Lagrange multipliers for the original 
problem constraints in Eqs. (lc) and (Id). The nonlinear programming algorithm of Ref. 2 
sometimes requires these multipliers. Alternatively, an active-set algorithm for solving inequality- 
constrained QPs needs to determine multipliers in order to decide which constraints to drop from 
the active set 

The multipliers can be calculated during the same stage-doubling process that calculates the 
x k vectors. Suppose that the multiplier vectors associated with Eq. (lc) are f° r k = 0,...,N-1 
and that the multipliers associated with Eq. (Id) are 0k for k = 0,...,N. Given all of the £& vectors 
at a given level of stage doubling, the following process determines the 0k vectors. First, adjoin 
the constraints to the cost in Eqs. (la)-(ld) using the and 0k multipliers. Second, perform the 



Q lk transformations defined in Eqs. (1 l)-(12c). Third, set the partial derivative with respect to 
%i k of the resulting expression equal to zero. Last, solve the resultant equation for by back 
substitution with the matrix. 

Knowledge of the flk and & time histories at a given level of stage doubling translates into 
knowledge of the flic time history at the next higher level of stage doubling. Thus, the process can 
be repeated. The single-stage problem that terminates the stage-halving process has no constraints 
like Eq. (Id); therefore, it has no flk vector that needs to be known ahead of time. This fact 
allows the multiplier computation algorithm to initialize at the beginning of the stage-doubling 
process. 

2.4 Procedures for an Indefinite or Positive Semi-Definite Projected 
Hessian. Section 2.3's equality-constrained QP solution procedure breaks down when the 
projected Hessian is indefinite or positive semi-definite. Step 4 fails for some stage number and 
level of problem halving. This situation is signaled by a break-down in the Cholesky factorization 
process. Either a negative square root or a divide-by-zero occurs, and the real Cholesky factor L. 5 k 

cannot be computed. In the indefinite case, the equality-constrained QP cost function has an 
infinite minimum at infinity. In the positive semi-definite case, the QP has a non-unique minimum. 

Two useful pieces of information can be derived in the indefinite and semi-definite cases if 
this algorithm is used as part of an SQP-type NP algorithm or as part ot an active-set inequality- 
constrained QP algorithm. First, the parent algorithm usually will want to know whether the 
equality-constrained QP's projected Hessian is positive definite, positive semi-definite, or 
indefinite 4 . Second, the parent algorithm may want to know a feasible direction of negative (zero) 
curvature in the indefinite (positive semi-definite) case. 

The method of determining these things is based on the modified Cholesky factorization 
process presented on pp. 108-110 of Ref. 7. When Cholesky factorizing a matrix that is not 
sufficiently positive definite, this procedure adds positive values to some of the matrix's diagonal 

4 Negative definite and negative semi-definite are considered to be synonymous with indefinite for 
purposes of this paper. 
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elements during factorization. This modification ensures that the matrix is sufficiently positive 
definite. For stage i, whose matrix H55 j is not sufficiently positive definite, the modified process 

would compute Cholesky factors of H55 ; + Ej: 

L5iL?i = H 55i + £t (32) 

where Ei is a positive semi-definite diagonal matrix that gets generated during the modified 
Cholesky factorization process. Nonzero diagonal elements of Ej get generated if a diagonal 
element of Lsi would otherwise be imaginary, zero, or too small or if any off-diagonal elements of 
L«5 j would be too large. All of these situations correspond to an Hssj matrix that is either 

indefinite, positive semi-definite, or positive definite but poorly conditioned. The latter two cases 
are effectively equivalent. 

The stage-halving and stage-doubling processes of Section 2.3 can still be earned to 
completion by using the modified L5 ; Cholesky factor where necessary in Eqs. (26a)-(26e) and by 

replacing Hssj in the corresponding Eq. (24) with Hssj+Ei. If the constraints are homogeneous, 

then this technique is guaranteed to produce a feasible descent direction for the original problem. 

If the Ei matrix has any nonzero elements, then some of these elements correspond to 

directions of zero or negative curvature. When the modified Cholesky factorization process 
produces a nonzero diagonal element of Ej to avert an imaginary diagonal element of L5 ; , this 

corresponds to a direction of negative curvature. A nonzero diagonal element ol Ej that has been 
produced to avert a zero (or small) diagonal element of corresponds to a direction of zero (or 

low) curvature. If, on the other hand, a nonzero diagonal element of E, gets produced to keep the 
off-diagonal terms of Lsj from becoming too large, then another nonzero diagonal element of of 

Ei will get produced later in the process. This latter element will correspond to a direction of 
negative or zero curvature. 

The modified Cholesky factor L5 j can be used to compute an actual direction of negative, 

zero, or low curvature. Suppose that the jth diagonal element of Ej is nonzero and that this 
element was produced to avoid an imaginary diagonal element of L5 ; . The calculation of a 

negative curvature direction begins with solution of the equation 


A64 



( 33 ) 


L J. Pj = cj 

for pj, where ej is a unit vector with a 1 in the jth row and zeros elsewhere. By referring to Ref. 

7, it is easy to prove that pj^HssjPj/pj^Pj < 0. The vector pj is a direction of negative curvature 
for the X 5 j vector. Similarly, if the jth diagonal element of £i had been produced to avoid a small 
or zero diagonal element of Lsj, then pjTHss ;Pj/Pj T Pj would be small or zero. 

Additional calculations are needed in order to determine the corresponding direction of 
negative curvature for the original problem in Eqs. (la)-(ld). These calculations are similar to the 

stage-doubling back substitution process defined in the Section 2.3. The process starts by setting 
X 5 j = Pj and X 5 k = 0 for all k * i at the current level of stage halving. It also sets all of the Xi k , 
X 3 k , and X 4 k vectors to zero at this level of stage halving. The feasible direction of negative 
curvature is calculated under the assumption that all of the constraints are homogeneous (i.e., dk = 
efc = fk = 0 for all k); it is a feasible direction in the null space of the constraints. 

Next, the process determines the negative-curvature Xk time history for the current level of 

problem halving. This time history can be determined in parallel by application of Eq. (20) [to 
determine theX 2 k ] followed by application of Eq. (11) [to determine the Xk]. Knowledge of the 

x k time history for the current level translates into knowledge of the X 3 2k > X4 2k* 
X 3 2k+1 , 311(1 * 4 2k+i time histories for the next previous level, and the stage-doubling back- 

substitution algorithm of Section 2.3 can be applied. During the ensuing stage-doubling cycles, 
the homogeneous constraints assumption is maintained, i.e. Xi k = dk = 0. 

When the projected Hessian matrix of the original problem in Eqs. (la)-(ld) is sufficiently 
positive definite, then all of the £* will be identically zero for all of the levels of stage halving. If 
any of the £k matrices have a nonzero diagonal element, then the projected Hessian is either 
indefinite, positive semi-definite, or positive definite but almost positive semi-definite. In the latter 
case, the projected Hessian is treated as being positive semi-definite to within the precision of the 
calculation. 

The indefinite case can be distinguished from the positive semi-definite case. If not one of 
the nonzero elements of the £* matrices was needed to avert an imaginary element of the 
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corresponding Ls k matrix, then the projected Hessian of the entire original problem is positive 
semi-definite. Otherwise, it is indefinite. 

In the indefinite case, several directions of negative curvature may be calculable by the Eq.- 
(33) technique. These may correspond to different stages and to different levels of problem 
halving. A parent algorithm that uses this QP algorithm may want to calculate all of these 
directions in order to do something like pick the one with the most negative curvature as a search 
direction. Note that none of these directions is guaranteed to be the direction of most negative 
curvature for the projected Hessian. This should not be a problem for most parent algorithms, so 
long as at least one direction of negative curvature can be calculated. 

The use of a parallel algorithm can compromise numerical stability when calculating 
directions of negative curvature. The modified Cholesky algorithm of Ref. 7 enforces a maximum 
limit on off-diagonal elements of the Cholesky factors. This maximum limits the size of the £k 
matrices and ensures numerical stability when the Hessian is indefinite. 

The parallel algorithm uses the modified Cholesky procedure to ensure that the off-diagonal 
elements of all of the Ls k matrices remain small, but elements of matrices such as L, 5 k H 3 j k and 
l4Hj k , which are used in Eqs. (26a)-(26e), can grow without bound. Numerical experiments 

indicate that this growth causes no problem when the original problem's projected Hessian is 
positive definite, but it can lead to round-off error problems in the indefinite case. The ultimate 
result of this instability is that calculated directions of negative curvature may not have as much 
negative curvature as they would have had if growth in matrices such as L. 5 k H 3 j k and 

could have been limited in a sensible way. Unfortunately, there seems to be no obvious way to 
limit this growth without incurring adverse side effects in the positive-definite case. 

2.5 Modifications to Deal with Degenerate Constraints. If the constraints in Eqs. 
(lc) and (Id) are degenerate, then the algorithm of Section 2.3 breaks down at step 2 for some 
level of stage halving. One of the Lj k matrices will be singular. 

When the constraints are degenerate, they may or may not admit a feasible solution. If they 
do not admit a feasible solution, then a sensible approach is to find the optimal Xk time history that 



also minimizes the mean square error in Eqs. (lc) and (Id). Whether or not the residual mean 
square error is zero, the approach to solving this modified problem is the same. It is a variation to 
the method in Section 2.3. The algorithm must be modified at steps 1 and 2 in order to handle 
degenerate constraints. 

In the modified step 1, the LQ factorization of the single-stage constraint Jacobian gets 


replaced by a complete LQ factorization; equation 


Q3 k D k Qi k = 


'Ll* O' 

o o. 


(10) gets replaced by 
for k = 0,...,N 


(34) 


where Qi k and Q 3 k are orthogonal and L i k is square, lower triangular, and nonsingular. Any 
matrix D k can always be factored in this manner. The new orthogonal matrix Q3 k transforms the 


single-stage constraints, splitting them into two types; 


1 

0 

1 

'x lk 


-<v 

= Q 3k d k 

for k = 0,...,N 

(35) 

L o oj 

_x 2k _ 


_ d 2k_ 





where Xi k and x 2k have been defined in Eq. (11). The first row ot the transformed constraints in 


Eq. (35) is non-degenerate. The second row in Eq. (35) is a set of degenerate constraints. In the 
degenerate case, Eqs. (lc) and (Id) of the original problem can be exactly satisfied if and only if 
d 2k = 0 for all k and for all levels of stage halving. 

In the modified step 2, the solution for x, k from the single-stage constraints gets replaced by 


a least-square solution of the single-stage constraints. In other words, Eq. (15) gets replaced by 


*ii 


= -Li' k di k 


for k = 0 N 


(36) 


Except for multiplier determination, all of the other parts of the algorithm remain the same, 
including the stage-doubling back-substitution process. The multiplier vector is under-determined 
when the constraints are degenerate. A sensible thing to do is to determine the minimum-norm 
multiplier vector. Suppose, as in Section 2.3, that Hk for k = 0, ..., N are the multipliers 
associated with the constraints in Eq. (Id) at a given level of stage halving. These multipliers can 
be transformed and split to correspond to the transformed and split constraints in Eq. (35): 
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= Q3 k & 


for k = 0,...,N 


(37) 


& 


The minimum-norm multipliers can be determined by a procedure similar to the multiplier 
determination procedure of Section 2.3. The cost is augmented by adjoining the transformed 

constraints using the transformed multipliers, and the partial derivative of the augmented stage- 
wise cost with respect to the Xi k vector is set equal to zero. This yields an equation for Hi k , 
which can be solved by back substitution with the matrix. The P? k vector does not affect any 
such partial derivatives. It is set equal to 0. The Jik vector is then determined by inverting Eq. 


(37). 

2.6 Modifications to Handle Multiple Stages per Processor. A modified 
algorithm can solve problems that have more stages than the number of available processors. It is 
similar in spirit to the multi-stage-per-processor algorithm described in Ref. 6. When a single 
processor has more than one stage, the algorithm starts by working only with the last stage of this 
set of stages -- it must have a contiguous set of problem stages. It starts by performing steps 1-4 
of Section 2.3 on this last stage. Next, it performs a modified version of step 5 to join the last 
stage on the processor to the second-to-last stage. Step 5 gets modified to account for the fact that 
the second-to-last stage may have some single-stage constraints. At the end of these steps, each 
processor has one less stage. This sequence of steps is repeated until each processor has only one 
stage. 

The modified algorithm performs a sort of backwards sweep through the stages on a given 
processor. After the sweep reaches the first stage on the processor, the algorithm can continue in 
the stage-halving manner described in Section 2.3. 

This backwards sweep is related to but different from the matrix Riccati backward sweep of 
LQR theory. First, the current backwards sweep operates on a more general problem form. 
Second, it uses LQ factorization rather than variable reduction to eliminate the constraints that join 
stages. Third, this sweep can handle arbitrary single-stage constraints such as "state" constraints. 


A68 



Figure 1, borrowed from Ref. 6, shows how a 24-stage problem would be mapped onto an 
8-processor binary tree (which can be formed on an 8-processor hypercube). The upper graph 
shows the original 24 stages (0 to 23). The lower graph shows how 16 stages remain after one 
stage-joining cycle has been executed on each node. Eight of the remaining stages are original 
problem stages, 0, 3, 6, 9, 12, 15, 18, and 21. Each of the other eight stages is the remains of 
two of the original problem’s stages that have been joined, (1,2), (4,5), (7,8), (10,11), (13,14), 
(16,17), (19,20), and (22,23). 

The back substitution process described in Section 2.3 undergoes a similar modification. It 
starts with the algorithm s original stage-doubling back substitution process. This terminates when 
each processor has received an x k vector corresponding to the first of its set of stages. A forward 
sweep is then made through the remaining stages to determine the remaining x k vectors. 

2.7 Expected Scaling of Wall Clock Time. The expensive operations of Section 
2.3's algorithm are the orthogonal matrix factorizations and matrix multiplications in steps 1 and 3 
anH the Cholesky factorizations and matrix multiplications of step 4. If n is the average number of 
elements of an x k vector, then each of these operations requires 0(n 3 ) flops or less. The actual 

number of operations depends upon the number of constraints in Eqs. (lc) and (Id), and this 
number is limited if the constraints are not degenerate. At later levels ot stage halving, the average 
number of elements of x k may increase, possibly becoming as large as 4n. The expensive steps 

then require 0(64n 3 ) flops or less, which is still 0(n 3 ). 

Step 2 and the back-substitution steps are relatively inexpensive. Both require 0(n 2 ) flops 
because they only involve matrix-vector multiplications and back substitution solutions of 
triangular linear systems. Step 5 is also inexpensive. It involves only message passing between 
processors and movement of data within the local memory of some processors. 

Analysis of the stage-halving procedure of Section 2.3 indicates that its wall-clock time is 
0[n 3 log2(N)] for an N stage problem executing on N processors. The time is 0(n 3 ) per stage- 
halving cycle, and the entire process requires log2(N) stage-halving cycles. 
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When the number of stages, N, is greater than the number of processors, p, then additional 
time is required for the initial backwards sweep that simultaneously occurs on each processor. If 
each processor has N/p stages, then this time is 0[n^(N/p)]. The entire solution procedure, which 
includes the backwards sweep followed by the stage-halving process, requires 0{n^[(N/p) + 
log 2 (p)]} This scales just as the algorithm in Ref. 6, although the constant in front of the scaling 
law differs between the two algorithms. 

The INTEL iPSC/2 is the machine on which this algorithm has been tested. It has a vector 
processor attached to each node. The vector processor time for doing such things as an inner 
product is approximately independent of n when n is small. The presence of a vector processor at 
each node affects the actual scaling of the wall clock time as a function of n when n is small: the 
algorithm’s wall clock time is 0(n 2 [(N/p) + log2(p)]} when solving an N-stage problem on p 
processors in this case. 

3. Performance Evaluation on a Test Problem 
3.1 Aero-maneuvering Test Problem. An aero-maneuvering problem described in 
Refs. 6 and 8 has been used to test the algorithm. It is a linear-quadraticization about a guessed 
solution of a nonlinear trajectory optimization problem. The problem has a 6-element state vector 
and a 3- or 4-element control vector, depending on the problem stage. The linear-quadratic 
problem form from Ref. 6 is 


find: 

Uo, Xi, ui, X 2 , .... UN- 1 , XN> and u N 

j=y (L[xM["?" xuk ' 

Z |_ M XU k H UU|c 


(38a) 

to minimize: 

[;iMs‘Wj[ x ; k ]> 

(38b) 


k=0 



subject to: 

x 0 given 


(38c) 


Xk+i = A k x k + B k u k + c k 

for k = 0 ... N-l 

(38d) 


R k x k + S k u k + t k = 0 

for k = 0 ... N 

(38e) 


where x k is the state vector at stage k and u k is the control vector. 
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Equation (38d) is a linearized dynamic system model, linearized about a guessed trajectory. 
It is the linearization of a model based upon Newtons laws for 3-degree-of-ffeedom translational 
motion of a point mass subject to gravity, and thrust or aero-maneuvering forces. A zero-order- 
hold assumption for the control inputs yields a discrete-time model. The presence of the 
nonhomogeneous Ck vector in the linearized dynamic model indicates that the guessed solution is 
infeasible; it does not obey the dynamic model. This situation is allowed at intermediate solution 
guesses by the NP algorithm of Ref. 2. 

The auxiliary single-stage constraints in Eq. (38e) are present only for some stages. The 
matrices Rk and Sk form the Jacobian for each stage's constraints, and the vector tk is the 
nonhomogeneous term. Constraints of this form are used at intermediate stages to define entry and 
exit of the atmosphere. For k = N, the Eq. (38e) constraints ensure that the desired terminal orbit 
is achieved. At stages where the peak allowable heating rate is exceeded by the guessed solution, 
additional Eq. (38e) constraints are included to enforce the maximum heating rate limit. 

Slack variables have been added to the problem before testing the algorithm. Slack variables 
are not necessary. They have been added in order to put the QP into the form that would be used 
by the NP algorithm of Ref. 2 if that algorithm were being used to solve the nonlinear optimal 
trajectory. In that situation, the QP algorithm of this paper would be calculating an NP search 
direction. Slack variables can be added to each of the constraints to penalize violations. Then the 


QP becomes 
find: 

to minimize: 


uo, yo. zo. x i. U,, y i , z\, X 2 , ..., UN- 1 , yN-1. z N-t. x n. un. and z n ( 39 a) 

N 


J = 


X ( ? [xM [hj‘hZ|] [ul] + [ g7 “- 8 “ T 4 


"Xk 

-Uk 


k=0 


N-l 


+ 




k=0 

N 


X h 2 + V h tk|T(Zk W h tkl 

k=0 


(39b) 
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subject to: xo given 


(39c) 

x k+ i = A k x k + B k u k - y k 

for k = 0 ... N-l 

(39d) 

R k x k + S k u k - = 0 

for k = 0 ... N 

(39e) 


TheQP in Eqs. (39a)-(39e) can be restated in the form of Eqs. (la)-(ld), given the following 


definitions of the solution vector at each stage: 
u 0 


X0 = 


X k H 


Xn 5 


yo 
zo. 

Xk 
U k 
yk 

LZ k J 
XN 
UN 
L Z N 


for k = 


(40a) 


(40b) 


(40c) 


The corresponding constraint Jacobians, constraint non homogeneous vectors, cost Hessians, and 
cost gradients are easy to determine. For example, 

D k = [R k , S k ,0, I 


E k = ^A k , B k , - -yj - I, 


F k = [-I, 0, 0, 0] 
H k = 


r H XXk H XUk 0 0 -1 

H x J k H UUk 0 0 

0 0 hi 0 

L 0 0 0 hi J 


for k = 1,...,N-1 
for k = 1,...,N-1 
for k = 1,...,N-1 

for k = 1,...,N-1 


(41a) 

(41b) 

(41c) 

(4 Id) 


For the sake of brevity, the remaining definitions of the equivalent problem (la)-( Id) matrices and 
vectors have been omitted. 

3.2 Computational Timing Results. Different size versions of this problem have been 
solved ranging from 8 to 128 stages. The storage requirements of the 128 stage problem take up 
more than half of the available 4 Mbytes of memory per node on a 32-node INTEL iPSC/2. 

A serial version of this algorithm has also been tested in order to determine the speed-up due 
to parallelism. The serial version solves the same problem, Eqs. (la)-(ld), but it does it using the 
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best known serial technique: a single backwards sweep, which is similar to the serial algorithm of 
Ref. 6. 

The timing results for these runs are reported on Fig. 2. The horizontal axis gives the 
number of problem stages on a log scale. The vertical axis gives the wall clock time on the INTEL 
iPSC/2. Each curve corresponds to a particular algorithm running on a particular number of 
nodes. As can be seen, all of the parallel algorithms are significantly faster than the serial 
algorithm. Figure 3 displays the speed-up due to parallelism by plotting the ratio of the serial 
algorithm time to the parallel algorithm time for the different cases. 

These graphs clearly display the benefits of parallelism. For the 128-stage problem, the new 
parallel algorithm running on 32 nodes is more than 10 times faster than the serial algorithm (the 
serial algorithm takes 12.4 sec in this case, which puts it well off of the scale of Fig. 2). This 
speed-up increases with increased problem size and vice versa. The efficiency of the parallel 
algorithm is not great, a speed-up of 10 for 32 processors translates into a 32 % efficiency. 

Note that this comparison is between the parallel algorithm and the best known serial 
algorithm of equivalent numerical stability. If the parallel algorithm were run on a single 
processor, it would be slower than the serial algorithm used in this study. Some authors in the 
parallel computation field use the latter type of "serial" algorithm as the benchmark to determine 
their speed-up due to parallelism. In that case, the parallel algorithm's speed-up would be about a 
factor of 16 for a 50% efficiency. This is a measure of the average busy time of each node during 
the execution of the parallel algorithm. This efficiency rises with (N/p), the number of problem 
stages per node. 

Though neither efficiency is impressive, efficiency is not the ultimate goal of this work. The 
ultimate goal is a reduced wall clock time so that real-time applications will be feasible. The 
extrapolated "N-stages-on-N-nodes" line on Fig. 2 shows that a problem with many stages can be 
solved in a reasonably small amount of wall clock time if enough processors are available. 

For comparison purposes, results from the earlier study of Psiaki and Park are included on 
Fig. 2 (Ref. 6). These are the two lowest curves on the graph; they do not apply to the algorithm 
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presented in this paper. These curves are the timing curves for the algorithm of Ref. 6 applied to 
aero-maneuvering problems in the form of Eqs. (38a)-(38e). This is lower-dimensional problem 
because it does not include the slack variables yk and Zk. The new parallel algorithm takes 
between 2.5 and 3 times longer to solve the Eqs. (39a)-(39e) version of a given problem as 
compared to the parallel algorithm of Ref. 6 operating on the Eqs. (38a)-(38e) version of the same 
problem. 

In order to make a sensible comparison between the two algorithms, the scaling results of 
section 2.7 must be used. The average number of unknowns at a given stage is 9 in the smaller 
problem, Eqs. (38a)-(38e), and it is 15 in the larger problem, Eqs. (39a)-(39e). One would expect 
the algorithm of this paper to execute about (15/9)2 = 2.8 times faster when solving the smaller 
problem because the wall clock time scales as n^. Thus, scaling analysis predicts that the new 
algorithm's speed on the smaller problem would be about equal to the speed achieved on that 
problem by the algorithm of Ref. 6. No comparison has been made with any of Wright's 
algorithms because they cannot solve the state-constrained aero-maneuvering problem. 

4. Conclusions 

This paper has presented an algorithm for the solution of dynamic quadratic programming 
problems on a parallel computer. The algorithm is an adaptation of the orthogonal 
factorization/null-space method to the parallel/dynamic programming framework. It takes 
advantage of the sparse "staircase" structure, and it uses domain decomposition techniques to 
achieve efficiency and parallelism. The use of a structured orthogonal factorization to determine 
the null space of the constraints ensures numerical stability of the null space determination. A 
structured block Cholesky factorization of the projected Hessian ensures numerical stability of the 
null-space optimization procedure when the projected Hessian is positive definite. 

Solution time scales as n^[(N/p) + log2(p)] for an N-stage problem on p processors with an 
average of n unknowns per stage. On a 32- node INTEL iPSC/2, the algorithm achieves solution 
times as low as 1.2 sec for a 128 stage problem with 6 state vector elements, 3 control vector 
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elements, and 6 slack variables. It is faster than the best known equivalent serial algorithm by a 
factor of 10 or more when solving large problems. 
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Abstract. An augmented Lagrangian nonlinear programming algorithm has been developed. Its 
goals are to achieve robust global convergence and fast local convergence. Several unique 
strategies help the algorithm achieve these dual goals. The algorithm consists of three nested 
loops. The outer-loop estimates the Kuhn-Tucker multipliers at a rapid linear rate of convergence. 
The middle-loop minimizes the augmented Lagrangian function for fixed multipliers. This loop 
uses the sequential quadratic programming technique with a box trust region step-size restriction. 
The inner-loop solves a single quadratic program. Slack variables and a constrained form of the 
fixed-multiplier middle-loop problem work together with curved "line" searches in the inner-loop 
problem to allow large penalty weights for rapid outer-loop convergence. The inner-loop quadratic 
programs include quadratic constraint terms, which complicate the inner-loop, but speed the 
middle-loop’s progress when the constraint curvature is large. 

The new algorithm compares favorably with a commercial sequential quadratic programming 
algorithm on five low-order test problems. Its convergence is more robust, and its speed is not 
much slower. 


Key Words. Nonlinear programming, sequential quadratic programming, augmented 
Lagrangian, trust region, constraint curvature. 


B3 


1. Introduction 


An algorithm that solves a typical nonlinear programming problem is presented in this paper. 
It is a variant of the augmented Lagrangian algorithm proposed by Hestenes (Ref. 1) and by 
Powell (Ref. 2) and described by Fletcher (Ref. 3). This paper adds several special features to the 
basic augmented Lagrangian algorithm to make it faster and more likely to converge. One such 
feature is the solution of a constrained problem that includes slack variables for each estimate of the 
multipliers. Another feature is the use of a sub-problem that has quadratic terms in the constraints 
in addition to the usual quadratic cost terms. The algorithm solves a sequence of such sub- 
problems, each subjected to box trust region constraints. 

This algorithm has been developed to serve as the core of a new nonlinear trajectory 
optimization algorithm (Ref. 4). The algorithm might be used to do real-time guidance of 
aerospace vehicles. This creates the need for a high degree of convergence reliability and speed. 


The algorithm of this paper has been designed to solve nonlinear programs of the form 


find: 

X 


(la) 

to minimize: 

J(X) 


(lb) 

subject to: 

Ci(x) = 0 

1 

II 

i 

(lc) 


Cj(x) < 0 

for i = (me+1), ..., m 

(Id) 


where x is the n-dimensional vector of quantities to be optimized, J(x) is the scalar cost function, 
and the q(x) for i = 1, ..., m are the scalar constraint functions with the first me constraints being 
equality constraints and the last mj (=m-me) constraints being inequalities. The functions J(x) and 
Ci(x) for i = 1, ..., m are assumed to be continuous and to have continuous first and second 
derivatives. 

The remainder of this paper is divided into 2 sections plus conclusions. Section 2 describes 
the algorithm, which consists of 3 nested iterative loops. Section 3 describes some test problems 
and compares the performance of the algorithm on these problems to that of NPSOL (Ref. 5). 
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2. Nonlinear Programming Algorithm Description 4 

2.1. Outer-Loop Augmented Lagrangian Algorithm. The basic outer-loop 

algorithm for applying the augmented Lagrangian method to problem (la)-(ld) is given by Fletcher 

(Ref. 3). Reference 3 also gives a proof of global convergence when a feasible point exists and 

when the global minimum of each fixed-multiplier problem can be determined. The algorithm is 

repeated with minor modifications here. The method works with guessed values for the optimal 

Kuhn-Tucker multipliers, X* for i = 1 m, and penalty parameters, pi for i = 1, ..., m, to define 

the augmented Lagrangian function: 

m e m 

Laug(*;k>U) = J(x) + t c *( x ) + “7 ) + ^ + ^ ^ 

i=l i=m e +l 

where pi > 0 for i = 1, ..., m and Xi £ 0 for i = (me+1), •••, m, where the operation [ ] + returns 
the quantity in the brackets if it is non-negative and zero otherwise, and where the vectors X p 
refer to [Xi, ..., X m ] T and [pi, ..., p m ] T , respectively. 


Outer-Loop Augmented Lagrangian Algorithm ffrom Re f. 3 with modifications!: 


1. Start with guesses Xi, and pi for i = 1, .... m. Set llc^^lloo - °°* Be sure Xi > 0 for 
i = (me+1), ..., m. Define constraint violation limits, ci max (> 0) for i = 1, ..., m. 

2. Minimize L aug (x;X,p) with respect to x to find x*(L,Q). Increase pj if necessary to 
ensure that lcj(x*)l < q max for i = 1, ..., me and that cj(x*) < c i 1 " 3 * for i = (me+1), 
..., m to keep intermediate guesses within reasonable limits. 

3. Let q = q(x*(X,p) ) for i = 1, ..., me 

and 


Ci = Ci(x*(Lp)} + — 
L Pi 


i+ 


Xi 

Pi 


for i = (me+1), ..., m. 


4 Reference 6 gives many algorithmic details that have been omitted from this section for the sake 
of brevity. 
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4. If all Icjl for i = 1, m are small enough, then terminate. 

5. If Icjl >“l , c (0ld) l , oo for any i = 1, m, then set pi = lOOpi for all such i and go 

to step 2. 

6. Setc< old > = [ci,...,c m ] T 

7. Set Xj = Xi + piCi for i = 1, m and go to step 2. 

The quantities c, for i = 1, m are, nominally, the constraint violations. 

Step 5 of the above algorithm is different from the algorithm given by Powell (Ref. 2) and 
repeated in Fletcher (Ref. 3). It raises penalty weights more rapidly to nforce a faster linear rate 
of convergence of the multiplier estimates, 1/100 rather than 1/4 as in Ref. 2. The enforcement of 
a™* limits in step 2 is an added feature that precludes the possibility that extreme descent of the 
J(x) function would cause the algorithm to diverge in a nonfeasible region. 

2.2. Transformation and Quadratic Approximation of Augmented Lagrangian 
Sub-Problem. An equivalent constrained form of the fixed-multiplier sub-problem in step 2 can 
be developed, one that remains numerically well-conditioned even as the pi — » <»: 

find: x and y (= [yi, ..., y m ] T ) ( 3a ) 

h me h m 

to minimize: J t (x,y) = J(x) + yXy> 2 + 7 { [yil + } 2 (3b) 

i=l i=m e + 1 

subject to: q(x) - yh/pi yj + (Xj/pi) = 0 fori = l, ..., m (3c) 

where Jt is the transformed cost function, the vj for i = 1, ..., m are added slack variables, and h is 
an arbitrary positive constant that allows adjustment of scaling. The slack variables are added to 
equality constraints and inequality constraints to penalize constraint violations. 

A quadratic problem (QP) that approximates problem (3a)-(3c) is 

find: Ax and Ay (4a) 
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to minimize: AJ t a) (Ax,Ay) = ^AxTH^Ax + g U)T Ax 

m e 

+y i J {yAy i 2 + V Pih Ci (j) Ayi} 

i=l 

m 

+ { [Ayi + V Pi/ h ci (j) ] + } 2 - y { [ci^]-} 2 ) (4b) 

i=m e +l 

subject to: ai®^Ax + yAx^Bi®Ax - V h/pi Ayi = 0 fori = 1, in (4c) 

- d® £ Ax < d® (4d) 

where Ax and Ay are, respectively, deviations of x and y from current guesses, x® and y®: 

Ax = x - x® 

Ayi s yj - yi® = yi - Vpi/h {ci(x®) + (Xj/pi)} fori = 1, in (5b) 

and where H® is the cost function Hessian evaluated at x®, Bj® is the Hessian of constraint i, 
g® is the cost function gradient, aj® is the gradient of constraint i, and 

q® = q(x®) + (X-i/pi) = V h/pi yi (j) for i = 1, m (6) 


The quantities d® (> 0) are box trust region bounds. The algorithm assumes that the Hessians of 
the cost and constraint functions are available either analytically or via finite-difference 
approximation. 

QP (4a)-(4c) is similar to the QP of an SQP technique in both form and function; the middle- 
loop solves a sequence of such problems in order to perform step-2 of the outer-loop. However, 
the inclusion of quadratic constraint terms in the problem, the Ax T Bj®Ax terms in Eq. (4c), is a 
significant deviation from the standard SQP technique. They make the approximate problem valid 
over a larger region, but complicate the solution procedure for Eqs. (4a)-(4d). The box trust 
region bounds in Eq. (4d) ensure that a solution to the approximate problem produces a decrease of 
the original augmented Lagrangian function. 

2.3. Middle-Loop SQP/Trust Region Algorithm for Minimizing the Fixed p/A. 
Augmented Lagrangian Function. This section defines the middle-loop algorithm to 
accomplish step 2 of the outer-loop augmented Lagrangian algorithm. 
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Middle-Loon SOP/Trust-Region Algorithm for Minimizing an Augmented Laerangian Function : 


2.1. Start with a guess of the solution to problem (3a)-(3c), x (0) , and a guess of the 
box trust region bounds, d (0) (> 0). Set j = 0. Compute L^ug = L a ug(x (0) ;A»U)- 

2.2. Calculate y®, H®, Bi®, ai®, and q® for i = 1, ..., m. 

2.3. (Approximately) solve problem (4a)-(4d) to determine Ax, Ay, and AJ[®. 

2.4. Compute L®Jf = Laug(x®+Ax;&,G), compute r = (L®^ - L a (j u ) g )/AJ t (j) , and 
compute y = 1 ^ n lAxjl/d^. 

2.5 If r £ 0.25, then set d® -1 * = 0.25 • y • d®. 

If r > 0.75 and y = 1, then set d® 1 * = min(2»d®,d (0) ). 

Otherwise, set d® 1 * = d®. 

2.6 If r 0, set x® !) = x® and L®^ = L® g . 

Otherwise, set x® 1J = x®+ Ax and 

2.7 Test termination conditions for x® 1 * and terminate successfully if they are 
satisfied. 

2.8 Set j = j + 1 . If r > 0, go to step 2.2; otherwise, go to step 2.3. 

This algorithm adopts the SQP philosophy of solving a quadratic approximation to the 
original problem to determine the next solution iterate. The use of quadratic terms in both the 
constraints and the cost makes the quadratic sub-problem a very accurate approximation of the 
problem in Eqs. (3a)-(3c) when near a local minimum of the latter problem that satisfies second- 
order sufficient conditions. Steps 2.5 and 2.6 of the algorithm implement the trust region ideas 
found in Ref. 3 (p. 96). 

2.4 Inner-Loop Solution of the Quadratically-Constrained QP 

The quadratic constraint terms make problem (4a)-(4d) much more difficult than a standard 
QP. An iterative Newton-type procedure is used to solve this sub-problem. This inner-loop 
generates a sequence of solution estimates Ax^ and associated slack variables 
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(7) 


&y■ 1 W=^[rfh {ai T AxW + yAx( k > T BjAx( k ) j fori=l,..., m 


where the superscripts (j) that appear in Eqs. (4a)-(4d) have been dropped for the sake of 

convenience. Equation (7) guarantees feasibility of Eq. (4c). 

Each successive iterate of AxOO is generated by a line search along a search direction 8x. 

Search directions are determined by solving a linear-equality -constrained QP of the form 


find 8x and 8y a 

to minimize: 8J(8x,8y a ) = ^[8x T ,8y a T ] 


’ K 0 1 [ 8x 1 

+ 

. 0 hi J LSyaJ 



subject to: 


' A « 

1 

Q 

Sx 


O' 

_ E« 

» J 

.5ya_ 


0 . 


(8a) 

(8b) 

(8c) 


where 8y a is a vector of increments to the slack variables associated with the active subset of the 

problem constraints in Eq. (4c), [A® D^)] is the Jacobian of these active constraints evaluated at 
Ax^\ E^O is the Jacobian of the active box trust region constraints, [gx^>8y a ^l ' s gradient of 


the Eq.-(4b) cost at Ax^, Ay a ^\ and H is a Hessian matrix of the form 

K = H + X P* B i (9) 

i€ Active Set 

Given a positive definite projected Hessian, the QP in Eqs. (8a)-(8c) remains well-conditioned as 
penalty weights approach infinity. The corresponding elements of D^O approach 0 in this case. 
When the projected Hessian is indefinite, the algorithm detects this condition and toggles back and 
forth in its choice of 8x between choosing a feasible direction of negative curvature and a projected 

steepest descent direction. 

The algorithm chooses one of two alternate updates for AxOO One update makes a straight- 
line search in the 8x direction: 

^ x (k+l) _ A x (k) + a 8x (10a) 


where a is the step length. The other update makes a curved search 

A x (k+l) = AxOO + a 8x + 8x(a) (10b) 

where 8x(a) corrects for the curvature in the active problem constraints. A similar correction has 
been used by Coleman and Conn (Ref. 7), Fletcher (Ref. 8 and Ref. 3, pp. 393-396), and Betts 
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and Huffman (Ref. 9). It helps the algorithm take larger steps when high penalty weights would 


limit a in Eq. (lOa)-type steps. The curvature correction is a solution of 

find 8x 

to minimize ^■8x T 8x 


subject to A® 5x + t 


(aSx^BijCabx) 

(a8x) T Bj 2 (a8x) 


= 0 


L(a8x) T Bj s (a8x) J 


(11a) 

(lib) 


(11c) 


where ii, .... i s are the indices of the active subset of the Eq.-(4c) constraints. 

The Pi multipliers used to form the Hessian in Eq. (9) are calculated in two different ways, 
depending on which type of line search is to be performed. They must make the cost in Eq. (8b) a 
v alid approximation, to second-order in a 8x, of the cost in Eq. (4b). If the straight-line search of 

Eq. (10a) is used, then the correct multipliers are 

Pi = pi q + Vpih Ayi for i = ii, ..., i s (12) 

If the curved search of Eq. ( 10b) is to be used, then the correct multiplier vector is the solution of 
find P = [Pit, ..., p is ] T (13a) 

to minimize j (A^ T P+ g x ) T (A^ T P+ g x ) ( 1 3b) 

The step length a is determined by a univariate minimization. When the straight-line step in 
Eq. (10a) is used, the cost function in Eq. (4b) is used as the step-length merit function, subject to 
the box trust region bounds in Eq. (4d). Equation (7) determines feasible values of Ay^ +1 ^ as a 
function of Ax^ k+1 ^ for use in this merit function. The merit function is piecewise -quartic in a. 
Exact univariate minimization can be performed in a finite number of operations. When the curved 
step in Eq. (10b) is used, a local approximation of the Eq.-(4b) cost is used as the merit function. 
It yields a piecewise-quadratic univariate minimization problem that also can be solved exactly [6]. 
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• *o ,nH the active set of box trust region 
U. e active set of problem constraints and the active 

constraints. , . 

2 3 2 Solve the linear-equatity-constrained QP of Eqs. <*»M*d for * 1 “““ 

direction, ISx.SyJ. Use Eqs. (13a>(13b) to demnnine die Pi mulnp tets o 

the Hessian. . , 

2 3 3 Determine a by performing a line search. Use die merit function assoc, - 

mth the curved step. Update die acdve set of problem constraints and the 

active set of box trust region constraints. 

r A „(k+D - Ax (k) + a5x+ 5x(a) and stop with 

2.3.4 Test termination condinons for ( m)_*.M + 

soludon ***> if conditions are satisfied. Odierwise, se, Ax = 
a8x and set k = k + 1 

2 3.5 Test stationary point conditions. If they are satisfied, then go stt P • ' 

2 36 If k 5 7 and AJi(Ax (k \Ay (k *) £ 0, then set Ax (W = 0, Ay =0,an 

initialize the active set of problem constraints ami the active se. of box trust 

region constraints. 

2 3 7 If k 2 15, then terminate before reaching the optimal solution. 

2 3 8 Solve the linear-equality-constrained QP of Eqs. ( 8 a)-(8c) for the search 

direction. [Sx.Sya]- Use Eqs. (12) to determine the Pi mulupl.ers or 

Hessian. , 

2 3 9 Determine a by performing a line search. Use die merit function assoc, 

with the straight-line step. Update die active se, of pmblem constramts and the 

active set of box trust region constraints. 

2.3. 10 Se, Ax‘ M > = AX« + Ct8x, se, k = k + 1 , and go to step 2.3.5. 
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Steps that drop active trust region bounds if non-optimal. 

2.3. 11 If AJ t (Ax (k) ,Ay (k) ) can be reduced by moving off of some active trust region 
bounds, then determine a feasible descent direction 5x that moves away from 
the non-optimal bounds, drop these bounds from the active set, and go to step 
2.3.9. 

2.3. 12 If AJ t (Ax (k) .Ay 00 ) > 0, then set Ax 00 = 0 , Ay (k) = 0 , re-initialize the active 
set of problem constraints and the active set of box trust region constraints, and 
go to step 2.3.8. 

2.3. 1 3 Terminate successfully. 

This algorithm should terminate after just one execution of steps 2.3.1-2.3.4 when the 
middle-loop algorithm is near its optimum. The active set will have been correctly identified, and 
the 8x direction calculated in this case will be the Newton direction for the middle-loop. Figure 1 
depicts a typical scenario for termination at step 2.3.4. 

When the algorithm enters the loop in steps 2.3.5-2.3.10, it searches for the optimum via a 
series of steps that use straight-line searches. Figure 2 depicts such a scenario. The heavy arrow 
marked a8x^ 0 ^ is the step-2.3.3 increment. The curvature correction 8x(a) is rejected in step 
2.3.4 because it does not come near enough to the optimal solution or to the constraint. The two 
increments a8x^^ and a5x^ are the search steps taken in the two subsequent iterations of steps 
2.3.5-2.3.10. The reason for trying the curvature correction 5x(a) before resorting to the straight- 
line searches aSx^ and aSx^ is that Sx(a) is relatively inexpensive to calculate. 

One danger of this technique is that a cost increase may occur during the aSx^ step. The 
approximate nature of the merit function used in step 2.3.3 could allow this to happen. Steps 
2.3.6 and 2.3.12 provide logic for recovering from such a problem. 

The inner- loop terminates after 15 iterations at step 2.3.7 even if problem (4a)-(4d) has not 
been completely solved. It is unwise to take too many inner-loop search steps per middle-loop 
search step because each inner-loop iteration involves a significant amount of linear algebra. 
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Changes to the active constraint sets can occur in two places. Problem inequality constraints, 
which are enforced via penalty terms, can be added or dropped during the line searches of steps 
2.3.3 and 2.3.9. Box trust region constraints can be added to the active set in these same steps, 
but step 2.3.1 1 is the only place where box trust region bounds can be dropped from the active set. 

2.5. Phase- 1 Constraint Satisfaction Algorithm. The convergence proof of the 
outer-loop algorithm assumes that the middle-loop can find a global minimum (Ref. 3), which is 
not guaranteed for this paper's algorithm. This limitation could cause the algorithm to converge to 
a point near an infeasible local minimum of the norm of the constraint violations. The algorithm 
tries to avoid this situation by using a Phase-I procedure to move from an arbitrary first guess to an 
almost-feasible first guess, one that does not violate any problem constraint, i, by more than 
0.01ci max . Phase-I executes the middle- and inner-loops with a multiplier guess & - 0, with very 
large penalty weights, and with the assumption that J(x) = 0. In order to make the inner-loop QP 
problem well posed, Phase-I uses = el and g® = 0 in Eq. (4b). 

2.6. Discussion of Algorithm. The algorithm has several advantages as compared to 
other nonlinear programming techniques. It has a strong likelihood of global convergence because 
the augmented Lagrangian outer-loop has good global convergence. Even with highly-curved 
constraints, the algorithm has reasonably fast global convergence because of the explicit use of 
quadratic constraint approximations in the QP sub-problem. Rapid local convergence is assured 
because the inner- and middle-loops converge quadratically under "normal" conditions, and the 
outer-loop has a very rapid linear convergence rate of .01. 

The algorithm also has several significant disadvantages. It must re-do full matrix 
factorizations for each iteration of the inner-loop. In a general context this is a serious deficiency, 
but in the intended application, trajectory optimization problems, a parallel factorization algorithm 
is available (Ref. 10), which decreases the impact of this weakness. Another disadvantage is the 
algorithm's reliance on analytic or finite-difference Hessians. The algorithm requires many more 
gradient-type evaluations per search step than a quasi-Newton method. Again, this weakness is 
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not serious in the context of trajectory optimization, where the Hessian is sparse but the quasi- 
Newton approximation is not, and where Hessian calculations are highly parallelizable (Ref. 11). 

3. Performance Evaluation on 5 Test Problems 
This section reports the results of tests of the algorithm's speed and convergence robustness, 
and it compares the algorithm with NPSOL version 4.02 (Ref. 5). Each NPSOL iteration is one in 
which it calculates a gradient, does a BFGS quasi-Newton update to the projected Hessian, solves 
a QP, and performs a line search. For comparison purposes, each iteration of the augmented 
Lagrangian algorithm's middle-loop is defined as one "iteration''. Five test problems are described 
in the appendix. Table 1 compares the performance of the two algorithms on the five problems. 

Table 1. 

Iteration Count Comparison of New Algorithm with NPSOL on 5 Test Problems. 


Problem 

NPSOL 

Middle 

Total 

New Algorithm Loops 

Middle Outer Inner 

Phase-I Total Total 

Middle 

Optimality 

1 

6 

6 

1 

3 

15 

5 

2 

Quits at 34 

73 

46 

4 

285 

27 

3 

Quits at 1 

15 

3 

5 

71 

12 

4A 

Quits 

Quits 

- 

- 

- 

- 

4B 

Quits at 7 

14 

3 

6 

89 

11 

5 

9 

13 

2 

5 

70 

11 


Obviously, the new algorithm is more robust. It converged in all but one of the cases listed, 
but NPSOL converged in only two of the cases. (Note that NPSOL's poorer robustness should 
not be construed as evidence that all SQP techniques are inherently less robust.) When NPSOL 
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does converge, it is as fast or slightly faster in terms of the number of QPs that get solved; compare 
the second and third columns in Table 1 for problems 1 and 5. For static problems on a serial 
processor, NPSOL’s speed advantage is more pronounced than indicated by the iteration counts 
because its QP sub-problems are cheaper to derive and cheaper to solve. 

A comparison of the last two columns in Table 1 shows that the average number of inner- 
loop iterations per middle-loop iteration ranges from 3 to 10.5 for the new algorithm. For each 
successful run, the early middle-loop iterations require multiple inner-loop iterations, and the latter 
middle-loop iterations each require only one inner-loop iteration, consistent with the second-order 
nature of the inner-loop. 

Variations in the choice of initial penalty weights can affect the performance of the new 
algorithm. Another solution run for problem 5 has been tried in which the initial penalty weights 
Pi are set at 100 (they are 1000 for the run listed in Table 5). This latter run is slower, requiring a 
total of 22 middle-loop iterations to solve the problem. 

4. Conclusions 

An augmented Lagrangian nonlinear programming algorithm has been developed. It consists 
of 3 nested loops. The outer- loop is an augmented Lagrangian algorithm with a first-order 
multiplier update and a rapid rate of linear convergence. The middle-loop algorithm is an SQP-type 
minimization of the augmented Lagrangian function. To alleviate ill-conditioning, the middle-loop 
problem is recast into a constrained form with slack variables that get penalized. The inner-loop 
algorithm solves a quadratically-constrained piecewise quadratic program with box trust region 
bounds. The algorithm has been compared with NPSOL on 5 low-order test problems. The 
algorithm has much more robust convergence than NPSOL, and its speed is comparable to 
NPSOL. 
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Appendix, 5 Test Problems 


to minimize: 
subject to: 


x = [XI, X2 ] t 

J(x) = - XI - X2 
x j + x^ - 1 < 0 

(xi + l) 2 + - 4 £ 0 


-2xi + X2 - 2 ^ 0 
First guess: x = [2,0] T 

Optimum: x = [.707 1,. 7071 ] T 


find: x = [xj, X2] T 

to minimize: J(x) = X2 

!•*. _ n2 . 


subject to: 


(xi - l) z + xj + 10000 (xi z + x 2 2 - l) 2 - .0625 < 0 


First guess: x = [0,.99] T 

Optimum: x = [.9688,-.2480] T 

st problem 3 [Ref. 3, p. 330]: 
find: x = [xi, X2, X3, X4] T 

to minimize: J(x) = xjX2 

, . . . (XIX3 + X2X4) 2 2 2., „ 

subject to: r 2 - x 3 - x 4 + 1 = 0 

X 1 + x 2 


- xi + X3 + 1 < 0 
-X2 + X4 + 1 < 0 

- X3 + X4 <0 

- X4 + 1 ^ 0 

Firstguess: x = [1,1,1, 1] T 


Optimum: 


x = [3.41,3.41,2.41, 1]T 
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Test problem 4: 
find: 


X = [xi, X2, X3, X4, X5, x^, X7] T 
to minimize: J(x) = y(x4 - X2XX3 - xi) 

subject to: (x^ - X 4 )(x 3 - xi) - (X 5 - xi)(x 2 - X 4 ) = 0 

X5 - X7(x4 - X2) = 0 
x^ - X7(X3 - xi) = 0 
* x 5 * x 6 +1 ^ 0 

xj + 1 <0 forj = 1,2 

- xj < 0 for j = 3,4,5, 6,7 

First guess 4A: x = [-2, -3, 6 , 6 , 6 , 6 , 6 ] T 
First guess 4B: x = [-2, -3, 6 , 6 , 6 , 6 , .7] T 
Optimum: x = [-1, -1, 2.41, 2.41, .71, .71, .21] T 

Test problem 5 : Minimum-fuel 2-bum impulsive orbit transfer from equatorial geosynchronous 
Earth orbit to 28°-inclination low Earth orbit. This problem is described in Ref. 6 . It has 8 
elements in its x decision vector, a linear cost, and 9 nonlinear inequality constraints. 
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List of Figures 


Fig. 1. Rapid termination of the inner-loop after one optimization step and one curvature- 
correction step. 


Fig. 2. Search steps taken by the inner-loop when it fails to terminate in one iteration. 
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A PARALLEL TRAJECTORY OPTIMIZATION TOOL FOR AEROSPACE PLANE GUIDANCE 
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Abstract 

A parallel trajectory optimization algorithm is being 
developed. One possible mission is to provide real-time, 
on-line guidance for the National Aerospace Plane. The 
algorithm solves a discrete-time problem via the augmented 
Lagrangian nonlinear programming algorithm. The 
algorithm exploits the dynamic programming structure of 
the problem to achieve parallelism in calculating cost 
functions, gradients, constraints, Jacobians, Hessian 
approximations, search directions, and merit functions. 
Special additions to the augmented Lagrangian algorithm 
achieve robust convergence, achieve (almost) superlinear 
local convergence, and deal with constraint curvature 
efficiently. The algorithm can handle control and state 
inequality constraints such as angle-of-attack and dynamic 
pressure constraints. Portions of the algorithm have been 
tested. The nonlinear programming core algorithm 
performs well on a variety of static test problems and on an 
orbit transfer problem. The parallel search direction 
algorithm can reduce wall clock time by a factor of 10 for 
this part of the computation task. 

L Introduction 

1.1. Objectives. 

This effort aims to develop a fast trajectory 
optimization algorithm that converges with high reliability. 
One possible application of such a tool is for real-time 
guidance of an aerospace vehicle such as an aerospace plane. 
Other applications would be to off-line trajectory planning 
or to off-line vehicle/design studies. 

The algorithm is designed to solve a fixed-time Bolza- 
type optimal control problem. Other problem types can be 
stated in this form, e.g., the minimax problem can be 
approximated by a Bolza problem, and the free-end-time 
problem can be transformed into a fixed-end-time problem. 
The problem under consideration is 

find: u( t) and x(t) for to < t < tf (la) 

tf 

to minimize: J= J L[x(t),w(t),i] dt + V / [x(tf)] (lb) 
to 
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subject to: x(to) given (lc) 

i =/[x(t),u(t),t] (Id) 

fl c [x(t),ii(t),t] = 0 (le) 

flj[x(t),*i(t),t] < 0 (If) 

flefMtf)]=0 (lg) 

n if [x(t f )]<0 (Ih) 


where x(t) and u(t) are the state and control time histories, 
respectively. Equation (lb) is a Bolza type cost, and Eq. 
(Id) defines the system's dynamics. Equations (le) and (If) 
are auxiliary constraints and may be pure control 
constraints, such as maximum angle of attack, pure state 
constraints, such as heating rate of an aerospace plane, or 
mixed control and state constraints, such as normal load 
factor. Equations (lg) and (lh) are terminal state 
constraints. 

Use For real- Time Guidance. A block diagram of how 
a trajectory optimization algorithm might be used for real- 
time control is shown in Fig. 1. It approximates «(t) by a 
sequence u(tk) u(tN_i) [tN = if]- The algorithm takes a 

sensor-based estimate of the initial state, x(tk). It uses that 
state as the initial condition in a trajectory optimization 
problem to compute the entire optimal control and state 

time histories, u(tk) tf(tN-l) and x(tk+i), ..., x(tN), and 

it sends the control for the current time, u(tk), to the 
controlled system. 

The proposed guidance system uses feedback by 
discarding its prediction of x(tk+i) at time tk+i in favor of 

the sensor-based estimate £(tk+i). Any discrepancy 
between these values forces the algorithm to re-opdmize the 
trajectory. Such a system should be more robust than 
programmed optimal control. A similar system has been 
used successfully on the Boeing Inertial Upper Stage [1]. 

Use for Off-line Studies . A faster algorithm with more 
robust convergence would also be helpful to off-line 
trajectory optimization. When used for traditional mission 
planning, increased speed and robustness would reduce the 
amount of engineering time required to do a job. The 
increased speed might make exotic design studies more 
practical. For example, robust trajectory optimization, in 
which the sensitivity of the solution to poorly known 
parameters is also constrained or penalized, could be made 
more practical. 

The present algorithm uses parallelism to achieve faster 
solution speeds. It is compatible with a new parallel 
supercomputer that NASA and a consortium of other 
agencies have purchased. This machine is a 500* node 






Fig. 1. Block Diagram of a Guidance System that uses a Trajectory Optimization Algorithm in the 

Loop. 


INTEL touchstone machine. Each node is rated at 18 
Mflops or more. Thus, this algorithm could take advantage 
of a highly parallel machine with about 9 Gflops of speed. 

1.2. Related Research. 

Many algorithms have been developed for off-line 
trajectory optimization, e.g. see Refs. 2-7. These are 
numerical gradient-based search algorithms. Generally, such 
algorithms are unsuited for on-line applications because 
they are too slow or because they do not converge reliably 
enough. 

Another approach to real-time optimization is the 
perturbation technique. It expands the problem in terms of 
a small parameter in such a way that the zeroth-order 
solution has a closed-form solution or one that is easily 
computed numerically. The higher-order terms are 
computed by solving a series of linear, nonhomogeneous 
problems [8]. Such designs may be very effective for a 
given application, but they will only work when a natural 
perturbation parameter exists and when the zeroth-order 
problem is easy to solve. Furthermore, each new 
application requires a new perturbation analysis to design 
the approximate optimal controller. 

Several researchers have explored the possibility of 
^sing parallelism to speed up numerical search trajectory 
optimization. Betts and Huffman were the first to 


implement a complete numerical trajectory optimization 
algorithm that makes use of parallelism [9]. Their work 
concentrated on parallelizing the calculation of cost 
derivatives and constraint Jacobians. Their algorithm 
realizes significant improvements until the problem size 
becomes too large, at which point their serial nonlinear 
programming algorithm becomes a significant bottleneck. 

Wright [10] and Psiaki and Park [11,12] have both 
worked on algorithms for parallelizing the linear algebra in 
the nonlinear programming portion of a trajectory 
optimization algorithm. Both approaches use a divide-and- 
conquer approach that does a stage-wise domain 
decomposition of the linear algebra. Both algorithms are 
faster than serial backwards sweep algorithms. Wright's 
algorithm is faster than that of Psiaki and Park, but it 
cannot handle auxiliary state constraints or an indefinite 
Hessian. 

The present algorithm uses the numerical search 
trajectory optimization approach. It defines a 
transformation to a discrete-time problem in order to make 
use of ideas from the field of general nonlinear 
programming (parameter optimization), but it retains the 
structure and benefits of the dynamic programming form. 
The parallel linear algebra and the augmented Lagrangian 
nonlinear programming algorithm developed in previous 
work are used [12]. Parallel gradient and Jacobian 
calculations are used, similar to Betts and Huffman [9]. 
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Parallel discrete-Newton Hessian approximation is used 
also. 

1.3. Outline of Paper. 

Section 2 describes the trajectory optimization 
algorithm. It gives transformations to a discrete-time 
problem form and then to a parameter optimization problem 
form. The section defines and discusses the nonlinear 
programming algorithm in the parameter optimization 
form. Section 2 finishes with explanations of how 
parallelism is used to speed up the calculation of 
derivatives, to perform matrix factorizations, and to evaluate 
a merit function. Section 3 describes the performance of 
sections of the code on test problems, the general nonlinear 
programming algorithm and the parallel linear algebra 
algorithm — the entire algorithm is not yet running. 
Section 4 explains the planned development sequence for 
completion of the entire algorithm, and section 5 gives 
conclusions about the algorithm s design and about the 
performance of some algorithm components. 

2. Algorithm Design. 

2.1. Transformation to Discrete Time. 

The problem in Eqs. (la)-(lh) will be solved on a 
digital computer via a numerical technique. Some form of 
problem discretization must be used simply to represent the 
solution. A number of researchers have developed 
algorithms directly for the continuous-time problem, which 
they implement approximately. Our approach is to first 
approximate the problem in Eqs. (la)-(lh) by a finite- 
dimensional numerical optimization technique. Then we 
develop an algorithm for the finite problem that can be 
exactly implemented (neglecting round-off error). Such a 
problem becomes a parameter optimization problem. 

Approximation of a continuous-time problem by a 
parameter optimization problem has several advantages. A 
parameter optimization algorithm can use many of the 
sophisticated techniques that have been developed by 
researchers in the general field of nonlinear programming. 
For example, the continuous- time problem of determining 
switching times and transversality conditions for state 
inequality constraints is conceptually simpler in discrete- 
time: it reduces to a matter of determining the active set of 
inequality constraints. With a good linear algebra package 
there is no need to differentiate the state constraints in order 
to get the controls to appear in them. 

The parameter optimization technique sometimes gets 
criticized because it can destroy problem structure and cause 
the resulting algorithm to be very slow. The particular 
form of parameter optimization problem used in this paper 
is a discrete-time optimal control problem. This preserves 
the dynamic programming structure. 

A special-purpose algorithm has been designed to solve 
the resulting nonlinear program (NP). General-purpose NP 
algorithms can be used [7,9], but they take no advantage of 
the dynamic programming problem structure. Such 


algorithms run very slowly for a large number of discrete- 
time steps 1 . The special algorithms of this paper exploit 
the sparsity and structure of the dynamic programming 
form. 

The zero-order-hold discretization of the control time 
history splits the interval [to,tf] into N intervals [tk,tk+il 
for k = 0, ..., N-l, with tN = tf. The control is modeled as 
being constant at the value Uk for each interval [tk,tk+i): 

r 9 

uk-l fortk-i<t<tk 

u k fort k <t<t k + i 

u(t) = < (2) 

iik+1 for tk+i < t < tfc +2 

Then, the problem in Eqs. (la)-(lh) may be transformed 
into the discrete-time form 


find: 

r t t i 
x= Uo, x U 

r T T T ■ 

1. x 2.-. UN-1. x N 

] (3a) 

to minimize: 




j = y L k (x k ,u k > 

+ v[xn) 

(3b) 


k-0 



subject to: 

x 0 given 


(3c) 


x k+1 = f k (x k ,ll k ) 

for k = 0 ... N-l 

(3d) 


a ek (x k ,u k ) = 0 

forks 0... N-l 

(3c) 


a i k ( x k .U k ) S 0 

for k = 0 ... N-l 

(30 


a c^( x fti) = 0 


(3g) 


a i N ( x N> * 0 


(3h) 


where the correspondence between the discrete-time control 
time history, u 0 , u v u 2 , .... u N . p and the continuous-time 

tt(t) has been defined in Eq. (2). The discrete-time state 
time history, x 0 , x v x 2 x N , corresponds to a sampling 

of the continuous-time state time history, x(t t ), x^), 
..., x(t N ). 

The difference equation in the discrete-time problem, 
Eq. (3d), models the system's dynamics. The function 

f k (x k ,u k ) on the right-hand side is defined by the solution of 
an initial value problem using the continuous-time system's 


f The terms step and stage are used interchangeably to refer 
to one discrete-time interval. 
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fferential equation, Eq. (Id). Defining x k (t) to be the 
..•ate between sample times t* and l kfl , it is the solution of 

the following initial value problem, 

x k =/Ix k (t),u k ,t) (4a) 

x k (t k ) = x k (4b) 

with the control held fixed at u k for the interval. Given that 

Eqs. (4a) and (4b) define x k (t), the discrete-time dynamics 
function is defined as 

f k ( x k> u k) = xk (*k + i) ® 

A consistent definition can also be made of the 
summand in the discrete-time cost, Eq (3b). Given the 
definition of x^t) in Eqs. (4a) and (4b), the summand is 


l k+l 

Lk(x k ,u k ) s J L[x k (t),u k ,t] dt (6) 

‘k 

Equations (4a)-(6) ensure that the discrete-time cost in 
Eq. (3b) and the discrete-time dynamics in Eq. (3d) exactly 
model their continuous-time counterparts, Eqs. (lb) and 
(Id). Equations (3b) and (3d) accurately model the effect of 
the zero-order hold on Eqs. (lb) and (Id). This modeling 
echnique is borrowed from Ref. 13. 

Note that the definitions in Eqs. (4a), (4b) and (6) 
involve integration. The computer subroutines that 
evaluate the f k (x k ,u k ) and L k (x k ,u k ) functions use numerical 

integration and user-supplied continuous-time functions, 
yix(t),«(t),t] and L[x(t),u(t),t]. Section 2.5 explains how to 
transform partial derivatives from continuous-time to 
discrete-time in a way that is consistent with these 
definitions. 

The auxiliary continuous-time constraints in the 
original continuous- time problem are sampled to produce 
the auxiliary constraints of the discrete-time problem: 

ae k (x k .u k ) = a e (x k ,u k ,t k ) for k = 0 ... N-l (7a) 

ai k (x k ,u k ) = «i(x k ,u k ,t k ) for k = 0 ... N-l (7b) 

Care must be taken so that the sample interval does not get 
too long. Otherwise, the continuous-time constraints in 
Eqs. (le) and (If) may get violated by significant amounts 
in the interval between sample times. 

The terminal constraints of the continuous-time 
problem carry over direcdy to the discrete-time problem: 

a e N (x N ) = a ef (x N ) ( & ) 

a i N (x N ) a <ijf{x N ) (8b) 


2.2. Transformation to Static Parameter 
Optimization Problem Form. 

The problem in Eqs. (3a)-(3h) is in a dynamic 
programming form. It has a special structure that can be 
used to develop an efficient algorithm. Nevertheless, this 
section transforms the problem into a general static 
parameter optimization form. This is done to simplify 
notation during the subsequent discussion of the overall 
nonlinear programming solution procedure, sections 2.3 and 
2.4. Later, in sections 2.5-2.7, the special structure will be 
discussed to show how efficiency and parallelism can be 
achieved 

In static form, the problem becomes 


find: 

X 

(9a) 

to minimize: 

J(%) 

(9b) 

subject to: 

C e (%) = 0 

(9c) 


Ci(X) < 0 

(9d) 


where the parameter vector x is defined in Eq. 3a. It is a 
long vector that contains the entire discrete-time control and 
state time histories. The cost in Eq. 9b is defined by the 
cost in Eq. 3b. 

The equality constraint in Eq. (9c) has many rows. It 
includes the state difference equations and the auxiliary 
equality constraints for stages 0 through N-l, Eqs. (3d) and 
(3e), and the terminal equality constraint, Eq. (3g): 


aeo(x 0 . u o) 
fo( x o.“o) - x i 

a e,(Xi,U,) 


c e (x) 


ae N-/ X N-l ,U N-l) 
^N-l( X N-l ,U N-l) ' X N 
— a e N ( x N) — 


( 10 ) 


The inequality constraint in Eq. (9d) also has many 
rows. It includes the auxiliary inequality constraints for 
stages 0 through N-l, Eq. (3f), and the terminal inequality 
constraint, Eq. (3h): 
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a i 0 ( x O lU o) 

aj^XpUt) 


cj(x) = 


( 11 ) 


a 'N-l( X N-l> U N-l) 


L *1^*^ -J 


2.3. An Augmented Lagrangian Nonlinear 
Programming Algorithm. 

The augmented Lagrangian algorithm is a standard 
nonlinear programming algorithm for dealing with 
nonlinear constraints [14,15]. Normally, the augmented 
Lagrangian method solves a series of unconstrained 
minimizations. Each unconstrained minimization works on 
a penalty function that includes linear and quadratic penalties 
of the constraint violations. The linear terms anse from 
guesses of the constraint multipliers, and they allow the 
algorithm to achieve exact constraint satisfaction with finite 
penalty weights on the quadratic terms. After each 
unconstrained minimization, the algorithm corrects its 
guesses of the constraint multipliers in order to improve the 
degree of constraint satisfaction. 

The augmented Lagrangian algorithm used in this paper 
formulates the problem in terms of slack variables, which 
get penalized. The slack-variable augmented Lagrangian 
algorithm solves a series of sub-problems of the form 


find: 

x, fie, and £; 

(12a) 

to minimize: 

T 

j(x) +!£&.+![&]* [vf 

(12b) 

subject to: 

C e (X)-Vh/p" Qe + (l/p)Aj = 0 

(12c) 


Ci(x)-Vh/p fii + (l/p)A* = 0 

(12d) 


where the parameter vector x, the cost function J(x), and 
the constraint functions c e (x) and Cj(x) have been defined 
in section 2.2. The vector fie contains the slack variables 
for the equality constraints, and fij contains the slack 
variables for the inequality constraints. The scalar constant 
h is arbitrary and is included to enhance numerical stability. 
The scalar p is a penalty weight. The vectors and A* 
constitute the guesses of the multipliers for the equality and 
the inequality constraints, respectively. 

The [ ] + symbol in Eq. (12b) is an operator that takes a 
vector input and produces a vector result in which all of the 
nonpositive elements have been replaced by zeros. This 


operator allows the inequality constraints’ slack variable 
penalty terms to penalize only positive constraint 
violations. 

A more traditional augmented Lagrangian algorithm 
arises when Eqs. (12c) and (12d) are used to eliminate the 
slack variables from the cost. The resulting cost function is 
called the augmented Lagrangian function: 

Jaug(x^^4)) = J(x) 

+ | |c e (%) + (l/p)A^J|c e (x) + (l/p)A^ 

T 

+ ^Ci(x) + (l/p)A*J + |ci(x) + (1/P)A* j + (13) 

The traditional augmented Lagrangian algorithm performs a 
series of unconstrained minimizations of the this function 
with respect to x. Our reasons for preferring the 
formulation in Eqs. (12a)-(12d) to this formulation centers 
around algorithm numerical stability issues for large p, 
which are discussed in Ref. 12. 

The solution to the augmented Lagrangian sub-problem 
is the solution of the original problem when the multiplier 
guesses are correct More specifically, if the multiplier 
guesses in Eqs. (12c) and (12d) are the Kuhn-Tucker 
multipliers associated with a local minimum of Eqs. (9a)- 
(9d), and if p is large enough, then a corresponding local 
minimum occurs in Eqs. (12a)-(12d). One can see this by 
comparing the first-order necessary conditions for the two 
problems. 

The multiplier guesses act as biases on the constraints 
in Eqs. (12c) and (12d). They cause the slack variables to 
be nonzero when the constraint functions are zero. In effect, 
they bias the penalties associated with the slack variables so 
that a nonzero penalty can occur when a constraint is 
exactly satisfied. This feature allows the minimum of the 
augmented Lagrangian function to occur at a value of x that 
yields zero constraint violation. This is true even for a 
finite penalty weighting, p. 

The main difficulty of the augmented Lagrangian 
algorithm is to determine the correct multipliers. The 
following algorithm has been proved to be globally 
convergent under appropriate assumptions [15]: 

Augmented Lagrangian Outer-Loop Iteration 

1. Start with guesses Ac" 1 , Ai*, and p. Set k = 0 and 

llc^lloo = oo. Be sure Ai* ^ 0. 

2. Minimize J a ug(*Ae* Ai*,p) with respect to x to find 

%(Ae*Ai*,p). 


C6 



Ce(x(Ae iAi »p)} 

3 Let c — t + 

L[c i {x(A e *Ai*.p)} +0/p)Ai*] - (i/p)Ai*_ 

4. If llclloo > Y 3 o i lc (k)|l oo set p = lOOp and go to step 2. 

5. Set k = k + 1 and = c. 

6. Set j = [~^% 1 + pc and go to step 2. 

The critical steps of this algorithm are steps 2, 3, 5, 
and 6. They form an iteration that feeds back constraint 
errors from step 3 to update multiplier estimates in step 6, 
which, in turn, affect the constraint violations that will 
result after the next minimization in step 2. This 
"feedback" has been proved to be stable for large enough p, 
and its rate of convergence is adjustable through p, which is 
why step 4 appears in the algorithm. The algorithm starts 
with initial multiplier guesses of 0 and successfully 
estimates the correct multipliers via this technique. 

The most expensive part of the algorithm is step 2. 
The algorithm completely solves the sub-problem in Eqs. 
(12a)-(12d) for each iteration of step 2. A technique akin to 
sequential quadratic programming is used to solve this sub- 
problem. Fortunately, after the first iteration of the main 
augmented Lagrangian outer-loop, a good first guess to the 
sub-problem of step 2 is usually available. In this case the 
algorithm used in step 2 requires very few iterations, 
sometimes just one. 

Another important point about step 2 is that the sub- 
problem can be ill-posed if p is not large enough. The 
problem may have an infinite minimum or it may have a 
minimum at x = Reference 12 describes a method of 
detecting difficulties and increasing p when necessary. 

An initial feasibility portion of the algorithm has been 
included as a further precaution to help ensure convergence. 
The algorithm performs an initial step-2 minimization with 
only the norm of the constraint violations for a cost 
function. The original cost function, J(x), is not 
considered. This step-2 minimization iterates until the 
maximum constraint violation is brought down below some 
user-specified tolerance. The Newton search directions 
during these iterations may be under-determined because 
there are fewer constraints than problem variables. The 
feasibility algorithm uses the minimum-norm search 
direction to resolve any such ambiguity. After this 
feasibility portion has been carried out, the algorithm 
initializes multiplier guesses at 0 and proceeds to step 1. 

2.4. Sub-Sub-Problems in the Augmented 
Lagrangian Algorithm. 

_ The second step of the outer-loop of the augmented 
Lagrangian algorithm is solved by a method that is like the 


method of successive quadratic programs (SQP). The basic 
SQP method guesses a solution and then expands the cost 
and constraints about the guessed solution in a Taylor series 
to create a quadratic program (QP) sub-problem. The cost 
gets expanded out to quadratic terms, and the constraints get 
expanded out to linear terms. This sub-problem then gets 
solved via a quadratic programming technique. 

In this paper the SQP-like technique is being applied to 
a sub-problem; therefore, each successive QP is a sub-sub- 
problem of the overall augmented Lagrangian algorithm. 
The sub-sub-problem solved here uses quadratic 
approximations of both the cost and the constraints in Eqs. 
(12a)-(12d). In addition, it includes box trust region 
bounds. The sub-sub-problem is 

find: Ax, AQe, and AQj (14a) 

to minimize: 

J jub = yAx T HAx + g T Ax 

T 

+ !{a(U + Vp/hCe} { AQ^ + Vp/h C e } 

+ j[AQ.i + V p/h Cj] [A& + V p/h cj] (14b) 

subject to: A e Ax + jAx T B e Ax - Vh/p AQe = 0 (14c) 
AjAx + jAx^BjAx - Vh/p AQj = 0 (14d) 
- Ax box — Ax — AXbox (14c) 

where Ax, AQe, and AQj are increments to the corresponding 
quantities in Eqs. (12a)-(12d). The matrix H is the Hessian 
of the cost function J(x), and g is the gradient of J(x). The 
matrices A e and A; are the Jacobians of c e (x) and c,(x), 
respectively, and the 3-index tensors B e and B; are the 
second derivatives of c e (x) and Cj(x), respectively. The 
vector c e * c e + (l/p)Ae*. and the vector c; * Cj + (l/p)Ai*- 
The vector AXbox contains the box trust region limits on 
the increments Ax. The guesses of Qe and ft are selected so 
that Eqs. (12c) and (12d) are satisfied. That is why Eqs. 
(14c) and (14d) are satisfied for Ax = AQe = AQj = 0. 

In order to solve the sub-problem in step 2 of the outer- 
loop augmented Lagrangian algorithm, a sequence of sub- 
sub-problems of the form in Eqs. (14a)-(14e) must be 
solved until Ax -> 0. For a well-posed sub-problem, this 
sequence of sub-sub-problems can be made globally 
convergent by adaptive selection of the box trust region size 

AXbox [15]. 


Cl 


The sub-sub-problem in Eqs. (14a) - (14e) is, itself, 
difficult to solve. If the quadratic terms in the constraints 
were not included, i.e., if B e = Bi = 0, then the problem 
would be a piecewise quadratic program with continuous 
first derivatives. It would be solvable in a Finite number of 
operations by a QP-type technique. The addition of the 
constraint curvature terms makes the problem piecewise 
quartic, and no algorithm exists that can exactly solve it in 
a fintie number of steps. It must get solved iteratively. 

The complication of constraint curvature has been added 
to this sub-sub-problem in order to reduce the number of 
sub-sub-problem solutions [Eqs. (14a)-(14e)] needed for a 
sub-problem solution [Eqs. ( 12a)-(12d)]. This happens 
because the sub-sub-problem provides an accurate model of 
the sub-problem for a larger range of Ax, which speeds the 
convergence rate remote from the solution. This reduces the 
number of times that first and second derivatives need to get 
calculated to set up the sub-sub-problem, which is 
important because these derivatives are computationally 
expensive. They involve numerical integration to do 
transformations from continuous-time to discrete-time, and 
the number of derivatives is large, on the order of the 
number of discrete-time steps. 

The solution procedure for the sub-sub-problem in Eqs. 
(14a)-(14e) is, itself, an iterative technique. It guesses a 
solution AXg, calculates a search direction, and does a line 
search. The line searches use a piecewise-quartic merit 
function with continuous-first derivatives. The 
discontinuous-changes in the higher derivatives occur when 
inequality constraints change from active to inactive or vice 
versa. Constraint activity changes occur when an element 

of the vector in the expression [ ] + changes sign. 

The search direction calculations for this sub-sub- 
problem involve solution of an equality-constrained 
quadratic program. This is accomplished by some matrix 
factorizations that are used to implement the null space 
method of quadratic programming [14]. First, a right QR 
factorization is performed on a constraint Jacobian matrix of 
the form 


A e - V h/p I 0 
A ia 0 - Vh/P I 

— Ebox 0 0 


( 15 ) 


The matrix A, a is the Jacobian of the active inequality 
constraints, the constraints corresponding to nonnegative 
values of A&i + Vp/h c,. The matrix Ebox corresponds to 
active box trust region constraints. For a trajectory 
optimization problem all of these matrices are large, but 
they are also sparse and structured. 

The algorithm must then determine the projected 
Hessian and factorize it. The orthogonal transformation 


from the QR factorization is used to transform the original 
Hessian matrix to determine the projected Hessian. The 
original Hessian takes the form 


’HO 0‘ 
0 hi 0 
.0 0 hi. 


(16) 


Each of the matrices will be large, and H will be block 
diagonal for a trajectory optimization problem. After 
transformation, the last factorization is a Cholesky 
factorization of the projected Hessian. 

These factorizations are used to calculate a search 
direction, and a line search is performed. The merit function 
used in the line search is the augmented Lagrangian function 
associated with the sub-sub-problem in Eqs. (14a)-(14e): 


J aug = y Ax^HAx + g T Ax 

+ J | c e + A e AX + yAX T B c Ax J 

| C e + A e AX + ^-AX T B e Ax | 

+ T 

+ A jAx + jAx T BiAxJ 

[di+ A jAx + yAx T BjAxJ 


(17) 


The foregoing discussion has been in terms of a general 
parameter optimization framework. This has been done for 
the sake of notational convenience, but the actual algorithm 
has been specialized to the dynamic programming problem 
structure in Eqs. (3a)-(3h). This is necessary for efficiency 
and parallelism. The most time-consuming parts of this 
algorithm are the calculation of derivatives to set up the 
sub-sub-problem, the factorization of matrices at each search 
step of the sub-sub-problem, and the calculation of the 
augmented Lagrangian function during line searches for the 
sub-sub-problem. The next three sub-sections explain how 
parallelism and the special dynamic programming problem 
structure can both be exploited to speed the algorithm steps 
described in this sub-section. 


2.5. Derivative Calculations. 

The sub-sub-problem in Eqs. (14a)-(14e) has various 
vectors and matrices that are first or second partial 
derivatives of functions from the problem in Eqs. (12a)- 
(12d): g, H , A e , B e , A j, and Bj. For trajectory 

optimization, the problem functions in Eqs. (12a)-(12d) are 
defined in terms of discrete-time problem functions from 
Eqs. (3a)-(3h). This sub- section shows how the discrete- 
time problem functions’ derivatives can be determined from 
the continuous-time problem functions’ derivatives, and it 
presents an efficient means of parallelizing these derivative 
calculations. The parallel derivatives scheme is almost 
identical to that developed by Betts and Huffman [9], except 
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that it operates on a different discretization of the 
>ntinuous-time problem. Their idea is extended to include 
"-second derivative calculations via finite differencing. 


The algorithm assumes that first partial 


derivatives with respect to x and u of the continuous-time 
problem functions in Eqs. (la)-(lh) are available. Second 
partial derivatives may also be available, or they may be 
estimated by Finite differencing of the first derivatives. 
Given this information, the partial derivatives of most of 


the discrete-time problem functions in Eqs. (3a)-(3h) are 
known. The only two functions whose partial derivatives 


are difficult to calculate are f k (x k ,u k ) and L k (x k ,u k ). 


Reference 13 explains how to differentiate these two 
functions with respect to their arguments. First, one 
differentiates Eqs. (4a) and (4b) with respect to the argument 
to get an initial value problem for the partial derivative of 
x^ft) with respect to the argument. Suppose one wants the 
derivative with respect to u k . Then the partial derivatives of 

Eqs. (4a) and (4b) yield 


d_ dx k (t) _ 3/ | 
dt 3u k _ ~ 3x [x k (t),u k ,t] 


+ ££| 


3ll [j^(t),u k .t] 


(18a) 


du k 


= 0 


(18b) 


''"’which is a nonhomogeneous linear time-varying matrix 
initial value problem for 3x k (t)/3u k , the control 
effectiveness matrix. To finish determining the partial 
derivative of f k (x k ,u k ), one differentiates Eq. (5): 


it , , 19t 

3u k 3u k 

The partial derivative of Eq. (6) finishes the calculation for 
Lk(x k ,u k ): 


3u k 



First partial derivatives with respect to x k are calculated in a 
similar manner. 


If Eqs. (4a), (6), (18a), and (20) are all integrated 
simultaneously with a Runge-Kutta algorithm, then 
truncation errors enter the calculations in a consistent way. 
This translates into the following fact: the derivatives 
calculated numerically in Eqs. (19) and (20) are the exact 
derivatives, neglecting round-off error, of the numerical- 


integration approximations of the functions in Eqs. (5) and 
(6). In other words, the computed derivatives are consistent 
with the computed functions. This is important to 
numerical optimization. It guarantees that, in the 
computer, the sub-sub-problem in Eqs. (14a)-(14e) is 
locally an accurate approximation of the sub-problem in 
Eqs. (12a)-(12d). 

Second derivatives are calculated in a similar way [131. 
Equations (18a)-(20) (or their 3/3x k equivalents) are 

differentiated once more. This results in a nonhomogeneous 
linear tensor initial value problem, which can be solved 
numerically. 

Parallel Differentiation . For each iteration in step 2 of 
the main augmented Lagrangian algorithm, a sub-sub- 
problem of the form in Eqs. (14a)-(14e) must be set up and 
solved. Sub-sub-problem set-up requires derivatives of all 
of the functions at each stage of the discrete-time problem 
[Eqs. (3a)-(3h)]. The cost gradient g in Eq. (14b) is 
composed of the gradients of L k (x k ,u k ) for k = 0, .... N-l 
and the gradient of V(x N ). The cost Hessian H in Eq. (14b) 
is composed of the Hessians of L k (x k ,u k ) for k = 0, .... N-l 
and the Hessian of V(x N ). The equality constraints’ 
Jacobian matrix A e in Eq. (14c) is composed of identity 
matrices, the Jacobians of f k (x k ,u k ) and a ek (x k ,u k ) for k = 0, 
..., N-l, and the Jacobian of a eN (x N ). The equality 
constraints' second derivative tensor B e in Eq. (14c) is 
composed of the second derivative tensors of f k (x k ,u k ) and 

a ek (x k ,u k ) for k = 0 N-l, and the second derivative 

tensor of a eN (x N ). The inequality constraints' Jacobian 
matrix A, in Eq. (14d) is composed of the Jacobians of 
3 i k (x k ,u k ) for k = 0, ..., N-l, and the Jacobian of a^(x N ). 
The inequality constraints’ second derivative tensor B; in Eq. 
(14d) is composed of the second derivative tensors of 
a ik (x k ,u k ) for k = 0, .... N-l, and the second derivative 

tensor of a iN (x N ). 

Thus, first and second partial derivatives of the 
following functions must be calculated for each stage: 
L k (x k ,u k ), f k (x k ,u k ), a ek (x k ,u k ), and a ik (x k ,u k ). This is a 

time-consuming operation because of the numerical 
integration required to transform from continuous-time to 
discrete- time and because of the numerical differentiation 
required to calculate second derivatives. 

The derivative calculations that get carried out at one 
stage are independent of the derivative calculations that get 
carried out at any other stage. Therefore, these operations 
can be done in parallel for different stages. If the number of 
stages equals the number of processors, then each stage s 
derivatives can be computed on a separate processor. If the 
number of processors is fewer than the number of stages, 
then each processor can compute the derivatives for several 
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stages. A speed-up factor of p can be obtained on p 
processors for this part of the algorithm if the number of 
problem stages, N, is an integer multiple of p and if each 
stage requires an identical amount of derivative calculation 
time. 

No message passing needs to occur during this process; 
therefore, no bottlenecks can occur. Nevertheless, one must 
be careful about which processors calculate derivatives for 
which stages. The results of the derivative calculations are 
used by a special parallel matrix factorization algorithm. 
That algorithm requires that the derivative information be 
distributed over the nodes of a binary tree in a particular way 
in order to expedite its message passing. The parallel 
factorization algorithm is described in Ref. 12. 

For a 24 stage problem (N = 23) running on an 8-node 
processor each processor could calculate the derivatives for 3 
stages. Figure 2 shows the 8 processors with a binary tree 
connection topology. The numbers next to the processor 
nodes show a distribution of problem stages for parallel 
derivative calculation. This distribution is consistent with 
the factorization algorithm of Ref 12. 

0,1,2 



Fig 2. Node Locations of Derivative 

Calculations for Each Stage of a 24-Stage 
Problem Run on 8 Processors. 

2.6. Parallel Linear Algebra. 

The current algorithm uses matrix factorization to 
determine a search direction. This is equivalent to solving 
the linearized state/adjoint boundary value problem that 
arises in calculus-of-variations-based numerical trajectory 
optimization algorithms [11]. The matrix 
factorizations/linear algebra associated with Eqs. (15) and 
(16) can be time consuming because the matrices are large. 
Furthermore, these factorizations may get carried out one 
hundred or more times during one trajectory optimization; 
one complete factorization is performed for each search step 
in a sub-sub-problem. Therefore, the algorithm makes use 
of parallelism and the special dynamic programming 
problem structure to expedite these factorizations. 

All of the matrices in Eqs. (15) and (16) are sparse 
(have many zero entries), and their nonzero elements fall 
into a block structure. Their structures are 


rx 


H = 


X 


X 
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A e 


0 

X 

XX 0 

X 

X X 
X 



Aj 
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0 

X 
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rx 


Ebox “ 


X 


X 


0 


0 


L 0 


( 21 ) 


( 22 ) 


(23) 


(24) 


where the X's indicate nonzero blocks. The blocks are not 
necessarily square, except for the blocks of the H matrix. 
Each column of the above block structures corresponds to a 
different discrete-time stage. The A e matrix is the only one 
with coupling between the stages. This is known as the 
"staircase" structure of dynamic programing, and it arises 
from the linearizations of the difference equations [Eq. (3d)]. 

The special matrix factorizations used by this algorithm 
are described in Ref. 12, and related factorizations are 
described in Ref. 1 1 . The factorizations amount to a special 
domain-decomposition* of the null space quadratic 
programming technique. They can factorize the matrices in 
wail clock time that is order 0{n 3 [(N/p) + Log2p]} for an 
N -stage problem running on p processors with a block size 
n. The factorizations have additional beneficial 
properties. They can check the 2nd-order sufficiency 
conditions for optimality, they can detect directions of 
negative curvature, and they can ensure that the search 
direction is a descent direction for the augmented Lagrangian 
function. 

2.7. Parallel Constraint and Merit Function 
Evaluations. 


The functions c e (%) and Cj(%) must be evaluated to set 
up the sub-sub-problem in Eqs. (14a)-(14e). Furthermore, 


* Domain decomposition is a parallelization technique that 
divides the problem into smaller problems, solves them, 
and then aggregates the problems. 
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augmented Lagrangian functions must be evaluated at two 
oints in the algorithm. The augmented Lagrangian 
— i unction in Eq. (17) is used as a merit function during the 
line searches that seek a solution to the sub-sub-problem in 
Eqs. (14a)-(14e). The augmented Lagrangian function in 
Eq. (13) is used as a merit function to evaluate and adapt the 
L\ trust region size. Some of these function evaluations 

involve numerical integration. All involve a number of 
operations that goes as the number of discrete-time problem 
stages. 

These operations can be expedited by a stage-wise 
splitting of the evaluation of components. The calculation 
of c e (x) and Ci(x) parallelizes in the same manner as the 
derivative calculations parallelize (see section 2.5). 
Evaluation of f k (x k ,u k ), a ek (x k ,u k ), and a ik (x k ,u k ) for all of 

the stages constitutes evaluation of c e (x) and Cj(x) [see 
Eqs. (10) and (1 1)], and these functions can be evaluated for 
different stages simultaneously on different processors. 

The augmented Lagrangian function in Eq. (13) can be 
rewritten as a sum over all of the discrete-time problem 
stages of a stage-wise augmented Lagrangian function. 

* * N 

Jaugfc^iAi »p) = ^Jaugfc (25a) 

k-0 


*\T 


where 

Jaugfc = ^k( x k»^k) 

+ ?lV x k .U k ) - X k< ., + (l/p)i k 

f k (x k .u k ) - x k+1 + (l/p)k k 
p / * \T 

+ jj ae k (x k ,u k ) + (l/p)Ue k 




a e k ( x k .U k ) + (l/p)Ue k 

j\»i k (x k ,u k ) + (l/p)Ui k 

* 

a i k (x k .u k ) + (l/p)Ui k 


* \T 


^augN “ ^(*n) 

+ + (^P)Ue]^ 

r * 

a e N (x N ) + (l/p)Ue N 


for k = 0, N-l 
(25b) 


+ 2 [ ai N (x N> + < 1 / P>^n] 

^ a i N (* N ) + <l/P)Ui N J (&) 

Equations (10) and (11) have been used to replace c c (x) and 
Ci (x) by their discrete-time problem component functions. 
The multiplier guess vectors hf* &1*, 2*2* > - and 
Uci*»Uc 2 *» — ^ die components of the large equality- 
constraint multiplier guess vector Ae*» and the multiplier 
guess vectors Ui 0 *. Ui 2 *. ... are the components of 

the large inequality-constraint multiplier guess vector Ai** 

A similar breakdown can be made for the discrete-time form 
of Eq. (17). 

The parallel computation of the augmented Lagrangian 
function in Eq. (25a) first computes the component 
functions, Jaugk* m parallel. Each component function can 

be computed on a separate node, or if there are fewer nodes 
than problem stages, each node can compute the augmented 
Lagrangian components for several stages. Figure 2 shows 
a sensible distribution of component function calculations 
when solving a 24-stage problem in 8 processors. The only 
message passing that will be required during this first part 
of the calculation will be to send the value of x k+1 to the 

node that is calculating Jaug^ for those values of k where 

this is necessary (for k = 2, 5, 8, 11, 14, 17, and 20 on Fig. 
2). The computation of the component functions is the 
most time-consuming part of the calculation, and it is 
completely parallelizable. 

The augmented Lagrangian calculation finishes by 
summing the component functions in what is called a "fan- 
in" on the binary tree. This can be illustrated in Figure 2 
for the 24-stage problem. First, each node sums the 
component functions for its three stages. Next, the 4 nodes 
at the bottom of the figure send their sums to the node 
above them in the tree. The four receiving nodes then sum 
their 3 -stage result with the received 3 -stage result to 
produce a 6-stage result The process then repeats itself; the 
2 nodes that are one row up from the bottom of the figure 
send their sums to the node above them in the tree. This 
process leaves some nodes idle for a fraction of the time, 
but it constitutes a very small fraction of the time required 
to calculate the augmented Lagrangian function. Therefore, 
the augmented Lagrangian calculation is almost 100% 
parallelizable. 

During solution of the sub-sub-problem in Eqs. (14a)- 
(14e) the algorithm does line searches to a univariate 
minimum of the function in Eq. (17). This function is 
piecewise quartic, and the univariate minimization procedure 
needs to know the exact quartic function for a given line 
search direction and interval of search length. This 
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calculation can be performed in a manner similar to the 
calculation of the augmented Lagrangian function. In this 
case, there is a binary fan-in to calculate each of the 5 terms 
in the 4th-order polynomial. 

The solution procedure for the sub-sub-problem in Eqs. 
(14a)-(14e) also calculates quadratic approximations of 

fk( x k’ u k)> a Ck (x k ,u k ), and a ik (x k ,u k ). Parallel evaluation of 

the quadratic approximations is performed in the same way 
as parallel evaluation of the functions themselves (see 
above). The solution procedure also needs to know when 
the quadratic approximation of ai k (x k ,u k ) + (l/pJULik* 

changes sign during a line search. This signals a 
discontinuity in the second derivative of the piecewise 
quartic merit function and triggers calculation of new 
coefficients of a 4th-order polynomial. The solutions of a 
quadratic equation give the search step lengths at which the 
inequality constraints change activity. These quadratic 
equation solutions can be distributed over different 
processors in the same stage-wise manner as depicted in 
Fig. 2. 

2.8. Review and Unification of Algorithm 
Parallelism. 

Sections 2.S-2.7 explain how to parallelize portions of 
the trajectory optimization algorithm presented in sections 
2.3 and 2.4. The parallel calculations have a recurrent 
theme: split the calculations by stages. Figure 2 shows 
how a 24-stage problem can be mapped onto 8 processors 
for parallelization of the derivative calculations, the linear 
algebra, the function evaluations, and the merit function 
evaluation. 

Information for a given stage resides on a single node, 
as in Fig. 2, in the over-all algorithm. That node stores the 

current guesses of x k , u k , 2^*, Ue k *, and ua k *. It evaluates 
the functions f k (x k ,u k ), L k (x k ,u k ), a^ (x k ,u k ), a ik (x k ,u k ), and 
J augk . It stores the search directions Ax k and Au k , and it 

updates the soludon guesses of x k , u k , 2kk*» lie**. ami iii k *. 

Each node communicates with other nodes only during 
certain operations: matrix factorizations to determine the 
search step, merit function calculations to check the box 
trust region size, and line search calculations in the sub-sub- 
problem of Eqs. (14a)-(14e). The algorithm uses one 
’'master" node. The master node does things like determine 
the final line search length, and it synchronizes the other 
processors so that each major and minor iteration of the 
algorithm happens simultaneously on all processors. The 
master node communicates with the other nodes in a "fan- 
out" along the binary tree. In the example of Fig. 2, the 
master node is the node at the top of the figure. 

3. Performance of Algorithm Components. 

The algorithm described in section 2 is still under 
development. Two key components have been developed 


and tested. One component is the augmented Lagrangian 
nonlinear programming algorithm described in sections 2.3 
and 2.4. The other component is the algorithm that does 
parallel matrix factorizations to solve a QP for the search 
direction, which is briefly described in section 2.6 and fully 
described in Ref. 12. The performance of these components 
is described in this section. 

3.1. Augmented Lagrangian Algorithm 
Performance in Test Problems. 

A serial parameter optimization form of the augmented 
Lagrangian algorithm has been encoded and tested. It has 
been encoded in matlab and tested on an IBM PC- AT. This 
section compares its performance to that of NPSOL version 

4.02. Two questions are to be answered: is the algorithm 
more likely to converge than NPSOL, and does the 
algorithm converge in fewer iterations than NPSOL? 

Each NPSOL iteration is one in which it calculates a 
gradient, does a BFGS quasi-Newton update to the projected 
Hessian, solves a QP, and performs a line search. Each 
augmented Lagrangian iteration is one in which it calculates 
derivatives and solves the sub-sub-problem in Eqs. (14a)- 
(14e). For dense parameter optimization on a serial 
processor, an augmented Lagrangian iteration is more 
expensive than an NPSOL iteration. The augmented 
Lagrangian requires more gradients per iteration (to do a 
discrete-Newton Hessian evaluation), it does multiple 
matrix factorizations per iteration, and it does multiple line 
searches per iteration. Nevertheless, iterations of the two 
algorithms are compared on a one-to-one basis because the 
augmented Lagrangian algorithm s iterations will be greatly 
sped up when put into the trajectory optimization/parallel 
processor framework. 

The performance of the two algorithms is discussed on 
a problem-by-problem basis for 5 different test problems. 

Test problem 1 : 


find: xi, X 2 

(26a) 

to minimize: J = - X] - X 2 

(26b) 

2 2 

subject to: Xj + x 2 - 1 ^ 0 

(26c) 

(xi + l) 2 + x^ - 4 < 0 

(26d) 

-2xi + X 2 - 2 < 0 

(26e) 

Both algorithms have been started from the same first 
guess: (xi, X 2 ) = (2,0). This is an infeasible first guess, 
and the Jacobians of constraints (26c) and (26d) are 


degenerate at this point: they have linearly dependent rows. 
The optimum is (.7071, .7071). NPSOL reaches it in 6 
iterations, and the present algorithm reaches it in 1 
feasibility and 5 optimality iterations. The two algorithms 
are about equally fast on this problem. 

Test problem 2 : 
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(29h) 

(29i) 


nd: xi, X2 (27a) 

to minimize: J = - X2 (27b) 

subject to: 

(xi - l) 2 + X2 + 10000 (Xj + x\ - l) 2 - .0625 < 0 (27c) 

Both algorithms have been started from the same first 
guess: (xi, X2) = (0,.99). This is an infeasible first guess. 
The constraint function has a deep curved valley whose 
bottom also has a gentle slope. Only a narrow section at 
the very bottom of a part of the valley is feasible. The first 
guess is slightly up the wall of the valley and remote from 
the feasible part. The optimum is (.9688,-.2480). NPSOL 
quits without reaching the solution. It is not able to 
achieve feasibility in 34 iterations, and it stops making 
progress. The present algorithm reaches the solution after 
52 feasibility and 25 optimality iterations. This test 
problem was designed to have particularly difficult 
constraint curvature. The curvature was enough to cause 
NPSOL to fail. The present algorithm was slowed down. 


but it eventually reached the optimum. 

Test problem 3: find the minimum-area right triangle 

that circumscribes a given circle [15]. 
find: xi,X 2 , X 3 , X 4 

(28a) 

to minimize: J = xjX2 

(28b) 

(X1X3 + x 2 X4) 2 2 2 , A 

subject to: 2 2 - x 3 - x 4 + 1 = U 

(28c) 

X 1 + x 2 

- X] + X3 + 1 < 0 

(28d) 

- X2 + X4 + 1 < 0 

(28e) 

- X3 + X4 <0 

(280 

- X4 + 1 < 0 

(28g) 


Both algorithms have been started from the same first 
guess: (xi, X2, X3, X4) = (1,1, 1,1). This is an infeasible 
first guess. The optimum is (3.41,3.41,2.41,1). NPSOL 
quits without reaching the solution. The present algorithm 
reaches it after 3 feasibility and 16 optimality iterations. 

Test problem 4: find the minimum-area right triangle 
that circumscribes a given circle. Problem modeling differs 


from test problem 3. 

find: xi, X2» X3, X4, X5, X6, X7 (29a) 

to minimize: J = - *2)( x 3 - x l) (29b) 

subject to: (X4-X6XX3-X1) - (X5-X1XX2-X4) = 0 (29c) 

x$ - X7(X4 - X2) = 0 (29d) 

X6 - X7(x3 - xi) = 0 (29e) 

- X5 - x] + 1 < 0 (290 

xi + 1 < 0 (29g) 


x 2 + 1 < 0 

- xj < 0 for j = 3,4 ,5,6,7 

Both algorithms have been started from the same first 
guess: (xi, x 2 , X3, X4, X5, X6, X7) = (-2, - 3 , 6, 6, 6, 6, 6). 
This is an infeasible first guess. The optimum is (-1, -1, 
2.41, 2.41, .71, .71, .21). NPSOL quits without reaching 
the solution. The present algorithm reaches it after 4 
feasibility and 9 optimality iterations. 

Tp.st problem 5: find the minimum-fuel 2-burn 
impulsive orbit transfer to go from equatorial 
geosynchronous Earth orbit (GEO) to low Earth orbit with 
a 28° inclination (LEO). This could be used by a space taxi 
to plan a rendezvous with the National Aerospace Plane in 
LEO after having ferried an astronaut out to GEO to fix a 
communications satellite. 


find: mo, AV 1 , Av, AV 2 (30a) 

to minimize: J = mo (30b) 

subject to: Newton’s laws for a spherical Earth (30c) 

Finite specific impulse of fuel (30d) 

ILEO - e r < Tf < rLEO + £r (30e) 

Vcirc. - e v S V f < V circ . + e v (30f> 
- Ey < Yf < +Ey (30g) 

28 ° - ej < if < 28 ° + e; (30h) 

""empty ^ m f <300 


where the usual terminal equality constraints on the final 
orbital elements have been replaced by the upper and lower 
bounds on the terminal quantities rf (geocentric radius), Vf 

(inertial speed) , Yf (flight path angle), and if (inclination) in 
Eqs. (30e)-(30h). This has been done to make the problem 
harder. The quantities mo and mf are the initial and 
terminal masses; their difference is the mass of the fuel 
burned during the two impulses. The quantities AV 1 and 
AV 2 are the vector velocity changes during the two 
impulses. The angle Av is the coast arc length between the 
bums. 

The effects of the decision quantities in Eq. 30a on the 
terminal quantities in Eqs. (30e)-(30h) are calculated by 
numerical integration of 6 coupled scalar equations of 
motion of the system. These equations can be found in 
Ref. 16. The formula for the mass change due to impulsive 
thrusting is 

r- IIAVill - 1IAV 2 II1 

mf = mo expl y" J P 1 ) 

where Vf is the nozzle exit velocity of the fuel. Note that 
Vf = gl, where g is the acceleration of gravity at the Earth’s 
surface and I is the specific impulse of the fuel. 

Both algorithms have been started from the same first 
guess, which satisfies all of the constraints except Eq. 
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(30i). NPSOL reaches the solution in 9 iterations. The 
present algorithm solves the problem in 6 feasibility 
iterations followed by 22 optimality iterations. It is about 
3 times slower than NPSOL. It gets near the solution 
fairly quickly, in about 6 feasibility iterations and 9 
optimality iterations, but it takes time to converge to the 
correct multipliers and to determine a high enough value of 
the penalty parameter, p. 

NPSOL is about 3 times faster on this problem for the 
given first guess, but it has trouble for a different first 
guess. If the first guess value of mo is increased to make 
the first guess entirely feasible, then NPSOL fails to solve 
the problem, which seems counter-intuitive. Failure occurs 
because the cost has no curvature of its own [Eq. (30b)]. 
When none of the constraints are active, the Lagrangian 
function also has no curvature; its Hessian is zero. NPSOL 
attempts to do a BFGS quasi-Newton update of a Hessian 
approximation. This results in a divide-by-zero, and the 
algorithm fails. Thus, NPSOL cannot handle linear 
programs. The algorithm presented in this paper has no 
such problems. 

The forgoing results indicate that NPSOL is more 
likely to have convergence problems than the algorithm of 
this paper. We believe that the main reason for this is 
because our algorithm first achieves feasibility. NPSOL 
could be modified to do the same. It would probably be 
slower, but its convergence would probably be more robust. 
Another problem with NPSOL could be its merit function. 
NPSOL is an SQP procedure. It is difficult to design a 
merit function to ensure global convergence for an SQP 
procedure when applying it to problems with nonlinear 
constraints. This may explain NPSOL’s failure on test 
problem 2. 

The present algorithm can also fail to converge. This 
will happen if, during the feasibility phase, it reaches a 
nonzero local minimum of the norm of the constraint 
violations. Most algorithms would have difficulty in such 
a situation. The way to avoid such situauons is to make a 
more reasonable first guess. Of course, a better first guess 
would help NPSOL as well, but NPSOL's ability to 
converge seems to be more sensitive to the choice of first 
guess. 

3.2. Performance of Parallel Algorithm for 
Matrix Factorizations and QP Solution. 

The algorithm that performs parallel calculation of 
search directions has been described in Ref. 12, where it is 
tested on a problem that is like test problem 5 in section 
3.1 above. The test problem in Ref. 12 performs the same 
orbit transfer, but it uses aeromaneuvenng and three 
impulsive burns. A dynamic programming problem form 
is set up similar to the form described in Eqs. (3a)-(3h). 
The problem has 6 state vector elements and 3 or 4 control 
vector elements at each stage. The search direction 
algorithm has been tested on a linear-quadratic expansion of 
the nonlinear problem about a guessed solution. Figure 3, 


borrowed from Ref. 12, shows how the wall-clock time of 
the algorithm varies with the number of processors and the 
number of problem stages. It also shows results for a 
backwards-sweep serial algorithm that solves the same 
problem. The parallel algorithm executes 10 time faster 
than the serial algorithm when solving a 128-stage problem 
on 32 nodes of an INTEL iPSC/2 hypercube. 



Fig. 3. Wall Clock Time to Solve for the 
Search Direction as it Varies with Problem Size 
and with Number of Processors. 


4. Planned Developments (the LORD willing!). 

The full algorithm is being developed to run on the 
INTEL iPSC/2 hypercube. Currently, the serial code for 
augmented Lagrangian parameter optimization is being 
translated from matlab code into FORTRAN code. We plan 
first to implement the algorithm serially on one processor 
of the hypercube. Afterwards we plan to replace the time- 
consuming serial parts with the parallel algorithms 
described in sections 2.5-2.7. Code that is borrowed from 
Ref. 13 will be used, with minor modifications, to do the 
parallel derivative calculations. The parallel linear search 
solver is already running. The only part of the parallel code 
that needs significant development is the parallel augmented 
Lagrangian calculation portion, described in section 2.7. 

The current plan is to develop the code to work with 
generic problem function forms. This would allow a user 
to rapidly change system models, cost functions, and 
constraints. The specification of these functions would be 
through user-written subroutines. The hope is that this 
code will be accessible to a wide variety of applications 
from different disciplines. 



5. Conclusions 

A nonlinear trajectory optimization algorithm has been 
developed. It has been formulated to run on a message- 
passing parallel processor. Special care has been taken to 
ensure convergence robustness. The algorithm solves a 
discrete-time trajectory optimization problem, which results 
from a continuous-time problem after a zero-order-hold 
discretization of the control time history. Numerical 
integration is used to complete the continuous-time to 
discrete-time transformation. It has been designed to handle 
both state and control constraints. The entire algorithm has 
not, as yet, been fully encoded, but the most complex 
portions have been encoded and tested. 

The algorithm uses the nonlinear programming 
technique known as the augmented Lagrangian algorithm. 
The discrete-time problem form can be directly solved by 
the augmented Lagrangian algorithm. The particular 
augmented Lagrangian algorithm of this paper has been 
specially tailored to increase the terminal convergence speed 
and to handle extreme constraint curvature. 

A serial parameter optimization form of the algorithm 
compares favorably with NPSOL version 4.02. It has a 
higher likelihood of converging from a given initial guess 
than NPSOL: NPSOL failed to work on more than half of 
the test problems, but the present algorithm worked on all 
of them. The present algorithm converges as quickly as 
NPSOL on one test problem, but has run as much as 3 
times slower on another. 

The algorithm has been designed to parallelize its most 
time-consuming parts. Function and derivative 
calculations, search direction calculations, and merit 
function calculations can all be parallelized on a stage-wise 
basis. Information about a particular problem stage always 
resides on a single node. The solutions for stages on 
different nodes affect each other only in two ways: through a 
special parallel routine that calculates a global search 
direction and through the limit to the search length as 
determined by a global ment function. The global search 
direction routine operating on 32 processors has achieved a 
speed-up factor as large as 10 over the best known serial 
algorithm. The other parallel portions should have speed-up 
factors approaching the number of processors. 
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Appendix D 


Specifications for problem model encoding 
Example code for the minimum-time proglem. 
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User-Defined Input Arguments to the Host Machine M.ulti-Processor 
Xrajectory Optimization program (MTOP) 


Note: All arg um ents must be dimensioned and typed in the FORTRAN 77 calling program 
exacdy as dimensioned and typed in the definitions below. 

1. Problem size/dimension specifications: 

N The INTEGERM SCALAR stage number of the terminal 

stage. The initial stage is number 0; so, there are N+l stages 
in total. 

NUVEC(0:N) The INTEGER*4 ARRAY that specifies the number of 

controls at each stage (e.g., NUVEC(K) is the number of 
controls at stage K). It is permissible for this number to be 
zero for some stages. 

NXVEC(0:N) The INTEGER*4 ARRAY that specifies the number of 

states at each stage (e.g., NXVEC(K) is the number of states 
at stage K). It is not permissible for this number to be zero 
for any stage except stage 0. Such a problem would better 
be solved as two seperate trajectory optimization problems. 
If NXVEC(K) * NXVEC(K+1), then the dynamic transition 
from stage K to stage K+l must be defined by an algebraic 
discrete-time function (see the discussion of the FLFNCT 
function in section 5). An algebraic discrete-time state 
transition function will be signalled by setting EDT(K)=1, as 
discussed in section 3. 
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NECVEC(0:N) 

NICVEC(0:N) 

NUMAX 

NXMAX 

NCMAX 

2. Initial guess of solution: 
U0(1:NUMAX,0:N) 


The INTEGER*4 ARRAY that specifies the number of 
auxiliary equality constraints of the form Ck(xk,Uk,tk) = 0 at 
each stage. It is permissible for this number to be zero for 
some stages. 

The INTEGERM ARRAY that specifies the number of 
auxiliary inequality constraints of the form Ck(xk,Uk4k) - 0 
at each stage. It is permissible for this number to be zero for 
some stages. 

An INTEGER*4 SCALAR that is used in dimensioning 
arrays that store control vectors. It must be true that 
NUMAX > NUVEC(K) for all K = 0 N. 

An INTEGER*4 SCALAR that is used in dimensioning 
arrays that store state vectors. It must be true that NXMAX 
> NXVEC(K) for all K = 0, .... N. 

An INTEGER* 4 SCALAR that is used in dimensioning 
arrays that store auxiliary equality and inequality constraint 
vectors. It must be true that NCMAX > 
NECVEC(K)+NICVEC(K) for all K = 0, ..., N. 


The REAL* 8 ARRAY that contains the initial guess of the 
optimal control time history. Note that U0(J,K) will be 
ignored for J > NUVEC(K). 
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X0(1:NXMAX,0:N) 


3. Problem modelling: 
IDT(0:N) 


The REAL* 8 ARRAY that contains the initial guess of the 
optimal state time history. Note that X0(J,K) will be 
ignored for J > NXVEC(K). The initial state guess need not 
satisfy the dynamic equations defined in the subroutine 
FLFNCT of section 5 of this document, nor must the initial 
guess satisfy the auxiliary constraints defined in the 
subroutine CFNCT of section 6 of this document. The 
optimization process will force the final solution trajectory to 
obey the dynamic equations and the auxiliary constraints. 

Note that X0(1:NXVEC(0),0) constitutes the state vector 
initial condition of the trajectory optimization problem. It 
will not be altered by the trajectory optimization algorithm. 


This INTEGER*4 ARRAY tells the algorithm whether the 
user-supplied FLFNCT function, defined below and in 
section 5, returns continuous-time functions that define the 
right-hand- side of the state differential equation and the cost 
integrand for stage K, or discrete-time state transition and 
cost functions for stage K. 

IDT(K) = 0 implies continuous-time stage-K dynamics and 
cost of the form: 

Dynamics: x = fjc(x,u,t) 

tk+1 

Stage cost: J Lk(x,u,t) dt 

tk 

D4 


I 



IDT(K) = 1 (or any nonzero integer) implies discrete- time 
stage-K dynamics of the form 


T(0:N) 


Dynamics: Xk+i = fk( x k. u k. l k) 

Stage cost: Lk(xk,Uk,tk) 

The algorithm ignores IDT(N), the stage N value, and 
assumes that a value of 1 applies. This lets it call FLFNCT 
to determine the terminal cost. No state transition function 
or differential equation function is needed at the terminal 
stage. 

This REAL*8 ARRAY gives the values of the problem’s 
independent variable (normally called time) at the different 
stages, i.e., X0(:,K) is the state vector initial guess for time 
T(K). For free end time problems, one must model the real 
time as a function of an artificial problem time whose end 
value is fixed. The vector T(0:N) stores the artificial 
problem time values. In this case, the rate of change of real 
time with respect to problem time will usually be a function 
of the controls and, perhaps, the states as well. The real 
time may need to get modelled as an extra state. The 
elements T(0:N) must not decrease as K increases. If a stage 
is modelled as a discrete-time process, if IDT(K) = 1 (or 
anything other than 0), then T(K+1) = T(K) is permissible. 
Otherwise, if the stage is modelled by a continuous-time 
process [IDT(K) = 0], then T(K+1) > T(K) is required. In 
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, c „ T fK+lVT(K) is the time interval used for 

this latter case, 1(>+U 


numerical integration. 

FLFNCT 

The name of a user-supplied SUBROUTINE drat calculates 
the functions f k (*,u,t) and U(x,u,t). which define the 
dynamics and cost a. stage K. This subroutine also must 
calculate the first partial derivatives of these functions wtth 
respect to the state vector, x, and the control vector, u. 
Section 5 gives detailed specifications for the correct destgn 


of this subroutine. 

CFNCT 

The name of a user-supplied SUBROUTINE that calculates 
the function Ck(x.u.t), which defines the auxiliary equal., y 
and inequality constraints a, smge K. Regardless of whether 
the stage is a continuous-time stage or a discrete tim 
these constraints are enforced only a, the beginning of the 


stage 

t n for i = 1 NECVEC(K) 

c ki (x k ,uk,tk) = 0 tori - i.— 


and 


c ki (Xk,Uk,tk) S 0 for i = [NECVEC(K)+U 

[NECVEC(K)+NICVEC(K)1 


The first NECVEC(K) elements of the c k constraint vector 

. the last NICVEC(K) elements 
are equality constraints, and the last rs 

are inequality constraints. 
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The CFNCT subroutine also must calculate the first partial 
derivatives of the Ck(x,u,t) function with respect to the state 
vector, x, and the control vector, u. Section 6 gives detailed 
specifications for the correct design of this subroutine. 


4. Algorithm control: 

KRK(1:2,0:(N-1)) This INTEGERM ARRAY contains quantities that control 

the algorithm and number of steps used for numerical 
integration of the continuous-time segments, the segments 
for which IDT(K) = 0. If KRK(1,K) = 1, then the standard 
4th-order Runge-Kutta algorithm is used on step K to do 
numerical integration. If KRK(1,K) = 2 (or any integer 
other than 1), then a 7-step, 6th-order Runge-Kutta method 
is used. In either case, KRK(2,K) gives the number of 
Runge-Kutta steps used to integrate from time T(K) to time 
T(K+1). 

ISECGR(0:N) This INTEGER* 4 ARRAY specifies whether the FLFNCT 

and CFNCT subroutines have been programmed to calculate 
analytic second derivatives. If ISECGR(K) = 0, then all of 
the required analytic second derivatives at stage K are 
calculated by the FLFNCT and CFNCT subroutines. If 
ISECGR(K) = 1, then the optimization algorithm will 
approximate these second derivatives via one-sided finite- 
differencing of the first derivatives provided by the FLFNCT 
and CFNCT subroutines. For example, an element of the 
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second derivative of f(x,u,t) with respect to x is 
approximated by 

afj 3f| 

_ gg(x+ejAxj) ~ 5g(») 
dx i^ x j " Axj 

where ej is a unit vector with a 1 on row j. If ISECGR(K) = 
2 (or any larger integer), then the optimization algorithm will 
approximate the second derivatives via central-differencing 
of FLFNCT's and CFNCT's first derivatives. For example, 
an element of the second derivative of f(x,u,t) with respect 
to x is approximated by 

3fj an 

a2f _ (x+ejAxj) ~ ^(x-ejAxj) 
axjcJxj’ 2Axj 

Additionally, symmetrization of the finite difference second 
derivatives is performed by averaging each Hessian (of a 
scalar) with its transpose. This symmetrization is performed 
in both the one-sided and central difference cases. 

Note that ISECGR(K) = 2 will execute more slowly than 
ISECRG(K) = 1, but the second derivatives will be more 
accurate. The numerical differentiation steps used are 
contained in the user-supplied arrays DELTU and DELTX. 

DELTU ( 1 :NUMAX,0: N) This REAL*8 ARRAY contains the finite-difference values 

used for calculating numerical second derivatives when 
analytic second derivatives have not been supplied. If 
ISECGR(K) * 0, then the values in 
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DELTU( 1 :NUVEC(K),K) are used as described above in 
the section on the ISECGR flag array. 

DELTX(1:NXMAX,0:N) This REAL* 8 ARRAY contains the finite-difference values 

used for calculating numerical second derivatives when 
analytic second derivatives have not been supplied. When 
ISECGR(K) * 0, then the values in 

DELTX( 1 :NXVEC(K),K) are used as described above in 
the section on the ISECGR flag array. 

5. Detailed specifications for the FLFNCT SUBROUTINE 

This is the dummy name of the subroutine that determines the dynamics and the cost 
function at each stage. The user uses this subroutine to model the problem. The user 
has the option of modelling any given stage as a continuous-time process or as a 
discrete-time process. 

If stage k is modelled as a continuous-time process, if IDT(K) = 0, then a zero-order- 
hold will be assumed for the control inputs, and FLFNCT’s output functions fk(x,u,t) 
and Lk(x,u,t) model a state differential equation and a cost integrand, respectively, for 
stage k. They are used as follows. Given 

t 

jc(t) = Xk + Jfk[x(x),uk,x] dx 

tk 

the state at the next stage and the cost for the stage are 

Xk+l = *(tk+l) 
tk-*- 1 

(Cost)k = |Lk[x(x),Uk,t] dx 

tk 
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where Uk and Xk are, respectively, the control and state vectors at stage k. 

If, on the other hand, stage k is modelled as a discrete-time process, if IDT(K) = l or 
some other nonzero integer, then FLFNCT's output functions fk(x,u,t) and Lk(x,u,t) 
define, respectively, the discrete-time state difference equation transition function and the 
discrete-time cost: 

Xk+1 = fk(xk.Uk.tk) 

(Cost)k = Lk(xk,Uk,tk) 

In this case, it is permissible that dim(xk+i) * dim(Xk) (i.e., NXVEC(K-i-l) * 
NXVEC(K)). 

Regardless of whether stage k is a continuous-time stage or a discrete-time stage, the 
FLFNCT subroutine must also calculate the first partial derivatives of fk(x,u,t) and 
Lk(x,u,t) with respect to x and u when an input flag calls for these to be calculated. 
Optionally, the user can program the function to calculate the second partial derivatives 
of fk(x,u,t) and Lk(x,u,t) with respect to x and u upon request as signalled by an input 
flag. 

If FLFNCT has not been programmed to calculate second derivatives for stage k, then 
the user must set ISECGR(K) to some other value than 0 so that the optimization 
algorithm will estimate the second derivatives via finite differencing. Also, values must 
specified for DELTU(1:NUVEC(K),K) and DELTX(1:NXVEC(K),K) if FLFNCT 
does not calculate second derivatives for stage k. 

The user is responsible for programming FLFNCT correctly so that it accurately models 
each stage k. The current stage number is provided as an input argument so that 
FLFNCT can perform the necessary branching to change models for stages that require 



such changes. Note that, at stage N, FLFNCT must not return a value for the fN(x,u,t) 
function because there is no next stage to which a transition must occur. Any fN value 
calculated for stage N will be ignored by the optimization algorithm. 

The FLFNCT program must have the following argument list and dimensioning and data 
typing information: 

SUBROUTINE FLFNCT(X,U,T,K,IFLAG,NX,NU,NF,F,L,DFDX,DLDX, 

1 DFDU,DLDU,D2FDX2,D2LDX2,D2FDXU,D2LDXU,D2FDU2, 

2 D2LDU2) 

INTEGER*4 K,IFLAG,NX,NU,NF 

REAL* 8 X(NX),U(NU),T,F(NF),L,DFDX(NF,NX),DLDX(NX), 

1 DFDU (NF,NU ) ,DLDU (NU) , D2FDX2(NF,NX,NX), 

2 D2LDX2(NX,NX),D2FDXU(NF,NX,NU),D2LDXU(NX,NU), 

3 D2FDU2(NF,NU,NU),D2LDU2(NU,NU) 

Note that all of the above arrays must be adjustable-sized arrays in order for the program 
to work properly. 

The input arguments for the subroutine are: 

X REAL* 8 ARRAY containing the state vector where X(I) = x;, the ith 

element of x. 

U REAL*8 ARRAY containing the control vector where U(I) = ui, the ith 

element of u. 

T REAL*8 SCALAR containing the problem time, t. It is expressed in 

the same units and with the same origin as the T(0:N) input array 
defined in section 3 above. 

K INTEGER *4 SCALAR specifying the current stage number. 

IFLAG INTEGER*4 SCALAR flag specifying which outputs are needed: 

IFLAG = 0: Only the functions, outputs F(:) and L, need be 

calculated on this call. 
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IFLAG = 1 : 


Only the functions and their first partial 
derivatives, outputs F(:), L, DFDX(:,:), 
DLDX(:), DFDU(:,:), and DLDU(:), need be 
calculated on this call. 

IFLAG = 2: The functions, their first partial derivatives, and 

their second partial derivatives, F(:), L, 
DFDX(:,:), DLDX(:), DFDU(:,:), DLDU(:), 
D2FDX2(:,:,:), D2LDX2(:,:), D2FDXU(:,:,:), 
D2LDXU(:,:), D2FDU2(:,:,:), and 

D2LDU2(:,:), need to be calculated on this 
call. Note that, if ISECGR(K) * 0, then the 
subroutine FLFNCT will always be called with 
IFLAG = 0 or IFLAG = 1. 

NX INTEGER*4 SCALAR used to define the size of adjustable-sized 

dummy argument arrays. 

NU INTEGER*4 SCALAR used to define the size of adjustable-sized 

dummy argument arrays. 

NF INTEGER*4 SCALAR used to define the size of adjustable- sized 

dummy argument arrays. This must be allowed to be different than 
NX because of the possibility of having explicit discrete-time steps in 
which the dimension of the state vector changes during a state 
transition. 

The output arguments for the subroutine are: 

F REAL*8 ARRAY containing the fk(x,u,t) function where F(I) = fjq, 

the ith element of fk(x,u,t). 

L REAL*8 SCALAR containing Lk(x,u,t). 
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DFDX 


REAL* 8 ARRAY containing 3f k (x,u,t)/9x where DFDX(I,J) = 
9f k /9xj. This output needs to be computed only when IFLAG 5 * 0. 
DLDX REAL* 8 ARRAY containing 9L k (x,u,t)/3x where DLDX(J) = 3L/9xj. 

This output needs to be computed only when IFLAG * 0. 

DFDU REAL* 8 ARRAY containing 3f k (x,u,t)/3u where DFDU(I,J) = 
3f ki /9uj. This output needs to be computed only when IFLAG * 0. 
DLDU REAL* 8 ARRAY containing 3L k (x,u,t)/3u where DLDU(J) = 3L/9uj. 
This output needs to be computed only when EFLAG * 0. 

The following 6 optional output arguments need not be calculated if the user specifies the 
use of finite differencing to compute second gradients - by setting ISECGR(K) * 0. In 
all cases, however, the names of these optional arguments must appear in the argument 
list. 

D2FDX2 REAL*8 ARRAY containing 9 2 f k (x,u,t)/3x 2 where D2FDX2(I,J,M) 
= 9 2 f ki /3xj3x m . This output needs to be computed only when IFLAG 

is neither 0 nor 1 . 

D2LDX2 REAL*8 ARRAY containing 3 2 L k (x,u,t)/3x 2 where D2LDX2(J,M) = 
3 2 Lk/9xj8x m . This output needs to be computed only when EFLAG is 
neither 0 nor 1 . 

D2FDXU REAL*8 ARRAY containing 3 2 f k (x,u ,t)/3x3u where 
D2FDXU(U,M) = 9 2 f ki /3xj3u m . This output needs to be computed 

only when IFLAG is neither 0 nor 1. 

D2LDXU REAL *8 ARRAY containing 3 2 L k (x,u,t)/3x3u where D2LDXU(J,M) 

= 3 2 Lk/3xj3u m . This output needs to be computed only when IFLAG 
is neither 0 nor 1 . 
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D2FDU2 REAL* 8 ARRAY containing 9 2 f k (x,u,t)/au 2 where D2FDU2(I,J,M) 

= 9 2 f ki /9uj9u m . This output needs to be computed only when IFLAG 

is neither 0 nor 1. 

D2LDU2 REAL *8 ARRAY containing 9 2 L k (x,u,t)/9u 2 where DL2DU2(J,M) = 

B^ic/dujdum. This output needs to be computed only when IFLAG is 
neither 0 nor 1 . 

WARNINGS: 

a) Never write to one of the input arguments. The optimization program that calls this 
subroutine assumes that FLFNCT does not change the value of any of the input 
arguments. 

b) Never write to an adjustable-sized output array if it has any zero dimension, NU = 
0 or NF = 0. The quantity NU will be equal to the value that the user has input to 
the main optimization routine as NUVEC(K). The quantity NF will be equal to the 
value that the user has input to the main optimization routine as NXVEC(K+1) 
except for stage N, in which case NF will be zero. As an example, if NU is zero, 
then the output arrays DFDU, DLDU, D2FDU2, D2LDU2, D2FDXU and 
D2LDXU will have no elements because one of their adjustable dimensions is zero. 
The main optimization program will not work in this case if FLFNCT tries to write 
to any of these arrays. 

c) Never write to any of the partial derivative output arrays if IFLAG has not been set 
to call for that quantity to be output. That is, do not write to DFDX, DLDX, 
DFDU, or DLDU unless IFLAG > 1, and do not write to D2FDX2, D2LDX2, 
D2FDXU, D2LDXU, D2FDU2, or D2LDU2 unless IFLAG > 2. 

d) The output value of L must be defined for every problem stage K. If stage K does 
not contribute to the cost, then the assignment statement L = 0 must be included in 
FLFNCT. Likewise, for the terminal stage where NF = 0, a value must be 



assigned to the L output argument despite the fact that no values will be assigned to 
the output array F. 

6. Specifications for the CFNCT SUBROUTINE 

This is the dummy name of the subroutine that determines the auxiliary constraints at 
each stage. The user uses this subroutine to further model the problem. The constraints 
defined in this subroutine are always enforced at the initial time of the interval. The 
constraints are specified in terms of the elements of the Ck(x,u,t) vector function: 

Ckj(Xk.Uk.tk) = 0 for i = 1, ..., NECVEC(K) 

Cki(xk,u k ,tk) <0 ‘ for i = [NECVEC(K)+1] [NECVEC(K)+NICVEC(K)] 

Note that CFNCT must be correctly programmed so that the first NECVEC(K) elements 
of Ck(x,u,t) define the equality constraint functions for stage k, and the last NICVEC(K) 
elements of Ck(x,u,t) define the inequality constraint functions for stage k. 

The CFNCT subroutine must also calculate the first partial derivatives of Ck(x,u,t) with 
respect to x and u when an input flag calls for these to be calculated. Optionally, the 
user can program the function to calculate the second partial derivatives of Ck(x,u,t) with 

respect to x and u upon request as signalled by an input flag. 

If CFNCT has not been programmed to calculate second derivatives for stage k, then the 
user must set ISECGR(K) to some other value than 0 so that the optimization algorithm 
will estimate the second derivatives via finite differencing. Also, values must specified 
for DELTU(1:NUVEC(K),K) and DELTX(1:NXVEC(K),K) if CFNCT does not 
calculate second derivatives for stage k. 

The user is responsible for programming CFNCT correctly so that it correctly models 
each stage k. There may be stages where the length of the equality part of the constraint 
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vector is zero or where the length of the inequality part of the constraint vector is zero. 
The current stage number is provided as an input argument so that CFNCT can perform 
the necessary branching to change constraints if the constraints are different at different 
stages. 

The CFNCT program must have the following argument list and dimensioning and data 
typing information: 

SUBROUTINE CFNCT(X,U,T,K,IFLAG,NX,NU,NC,C,DCDX,DCDU, 

1 D2CDX2£>2CDXU,D2CDU2) 

INTEGER*4 K,IFLAG,NX,NU,NC 

REAL* 8 X(NX),U(NU),T,C(NC),DCDX(NC,NX),DCDU(NC,NU), 

1 D2CDX2(NC,NX,NX),D2CDXU(NC,NX,NU),D2CDU2(NC,NU,NU) 

Note, all argument arrays must be adjustable- sized arrays with the dimensions as defined 
above in order for the calling program to function properly. 

The input arguments for the subroutine are: 

X REAL*8 ARRAY containing the state vector where X(I) = xi, the ith 

element of x. 

U REAL*8 ARRAY containing the control vector where U(I) = uj, the ith 

element of u. 

T REAL* 8 SCALAR containing the problem time, t. It is expressed in 

the same units and with the same origin as the T(0:N) input array 
defined in section 3 above. 

K INTEGER*4 SCALAR specifying the current stage number. 

[FLAG INTEGER*4 SCALAR flag specifying which outputs are needed: 

IFLAG = 0: Only the function C(:) needs to be calculated on 

this call. 
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IFLAG = 1: 


Only the function and its first partial 
derivatives, outputs C(:), DCDX(:,:), and 
DCDU(:,:), need be calculated on this call. 

IFLAG = 2: The function, its first partial derivatives, and its 

second partial derivatives, outputs C(:), 
DCDX(:,:), DCDU(:,:), D2CDX2(:,:,:), 
D2CDXU(:,:,:), and D2CDU2(:,:,:), need to be 
calculated on this call. Note that, if 
ISECGR(K) * 0, then the subroutine CFNCT 
will always be called with IFLAG = 0 or 
IFLAG = 1. 

NX INTEGERM SCALAR used to define the size of adjustable- si zed 

dummy argument arrays. 

NU INTEGERM SCALAR used to define the size of adjustable-sized 

dummy argument arrays. 

NC INTEGERM SCALAR used to define the size of adjustable-sized 

dummy argument arrays. 

The out put arguments for the subroutine are: 

C REAL *8 ARRAY containing the Ck(x,u,t) function where C(I) = ciq, 

the ith element of Ck(x,u,t). 

DCDX REALM ARRAY containing 9ck(x,u,t)/3x where DCDX(I,J) = 
9ciq/9xj. This output needs to be computed only when IFLAG * 0. 

DCDU REALM ARRAY containing 9ck(x,u,t)/9u where DCDU(I,J) = 
9ckj/9uj. This output needs to be computed only when IFLAG * 0. 

The following 3 optional output arguments need not be calculated if the user specifies the 
use of finite differencing to compute second gradients — by setting ISECGR(K) * 0. In 
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all cases, however, the names of these optional arguments must appear in the argument 
list. 

D2CDX2 REAL*8 ARRAY containing 9 2 Ck(x,u,t)/9x 2 where D2CDX2(IJ,M) 
= 3 2 cic/5xj9x m . This output needs to be computed only when IFLAG 

is neither 0 nor 1 . 

D2CDXU REAL*8 ARRAY containing d 2 Ck(x,u ,t)/8x9u where 
D2CDXU(I,J,M) = 3 2 Ckj/3xj3u m . This output needs to be computed 

only when IFLAG is neither 0 nor 1. 

D2CDU2 REAL*8 ARRAY containing 3 2 Ck(x,u,t)/9u 2 where D2CDU2(IJ,M) 
= 9 2 ciq/9uj3u m . This output needs to be computed only when IFLAG 

is neither 0 nor 1 . 

WARNINGS: 

a) Never write to one of the input arguments. The optimization program that calls this 
subroutine assumes that CFNCT does not change the value of any of the input 
arguments. 

b) Never write to an adjustable- sized output array if any of its dimensions is NU and 
NU = 0 for the current call of CFNCT. (CFNCT will not be called if NC is zero.) 
The quantity NU will be equal to the value that the user has input to the main 
optimization routine as NUVEC(K). The quantity NC will be equal to the sum of 
two values that the user has input to the main optimization routine: 
NECVEC(K)+NICVEC(K). If NU is zero, then the output arrays DCDU, 
D2CDXU and D2CDU2 will have no elements because one of their adjustable 
dimensions is zero. The main optimization program will not work in this case if 
CFNCT tries to write to any of these arrays. 

c) Never write to any of the partial derivative output arrays if IFLAG has not been set 
to call for that quantity to be output. That is, do not write to DCDX or DCDU 
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unless IFLAG > 1 , and do not write to D2CDX2, D2CDXU, or D2CDU2 unless 
EFLAG > 2. 
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c 

C- 


THESE SUBFTO IT ‘.MET WERE Lpr' T Ef : E" M . L. . RSI API IN MARC- 1 GC ?3 . 

SUBROUT I ME F _MNTM < X , U , T - P - l FLAG , 1 1 x , NU ? NF . F , L , DF D ; , OLD > . 

J DFDU , DLL U , DSFD> 3 * DSLD v £ % DSFD XU « DSLDXL - DBF DU? - 

S D2LDU2) 

IMPLICIT REAL* 8 < A-H.O-Z ' 

IMPLICIT I NT EGER* 4 ( I-N) 

PARAMETER <N=31 > 

REALMS L 

D I MENS ION •: < n x , , IJ < NU) , F ? NF j , 

1 DFD>: 1 NF , NX ) * DLDX < NX 1 * DFDU > NF , NU , , DLDU < NU ) - 

2 0 2 F D X 2 ( .NF » N X , N X ) . DSLD >' 2 < N X * N >.' ) , DEFD X U ( t F - r 4 x. , N U ■ . 

3 C2LBXU •; NX , NU ) , D2FDU2 ’ NF * NU , NU ) , DSL DUE *; NU * NU > 

SAVE 

NOTE THE FOLLOWING DEFINITIONS APPL f Tn STAGES i THROUGH - N- 1 

■ ! = ; 1 POE IT I Ot! 

■■ ■ E - ' -OS IT 10! i 

•' ? 1 ~ ' 1 V'ELOC I T '" 

kui =- ‘-!2 VELOCITY, 

■ ( e ” ■ = .1 ACCELERATION < USUAL L v CALLED' UU 

v - .:2 ACCELERATION (USUALLY CALLED US ' 

V:—, ,r: RATE of PASSAGE of REAL TIME SEC; WITH RESPECT 

TO PROBLEM TIME STAGES i s N ) 

l_! \ . -- PATE OF CHANGE C~ XI ACCELERATION 

U ■’ 2 ‘ ■= RATE OF CHANGE OF XE ACCELERATION 
ij s j r= RATE OF PASSAGE OF REAL TIME ■; SEC V WITH RESPECT 
T U PROBLEM TIME \ ST AGE O ONLY). 

NOTE THAT THE PROBLEM TIME HAS A TERMINAL VALUE OF 1 . 

CL STAGE " ASSUME HF = V . NX - .. AND NU = S. THIS DISCRETE-TIME STAGE 
SETT ? if-' T'-TS FIXED FOS I T ION AND VELOCITY INITIAL CONDITIONS* AND I T 
ALLOWS THE OT I Ml CAT I ON ROUTINE TO SET UP OPTIMAL INITIAL ACCELERATIONS 
AND THE TERMINAL. FROBLEM TIME. 

IF ; . EO „ 0 * r HEN 
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OF POOR QUALITY 



o n n 


p ; | j - - 7 3 . 6 O 5 3 8355 1 -+ S 7 39 D + ■- > 0 
F i p 1 - 3 5 . J. a 54 8 6 1 5 -'+ 2 1 7 4 a I) + A o 


Pj: 


DO 


10 '') 


1 lU 
1 20 


i4«: 


F- 3 2 . 

:>1S3A02S1 47D+00 

r : H ; ■- . h 30 

9 3° 3 1 3 4 7 Q 2 p. T » ■+• '■ > 

t ' ■' 5 ~ J (1 


F « o ) - i ! * 2 ! 


F' If 3- 


I r- i F i. AG . GE . 1 

^ THEM 

DO 20 I I - 

1 ,2 

DLDLK I I > 

= 0 „ D+00 

DO 11 13 

= 1,0 

DF DU . 

JJ , 11 - = 0 . D +00 

CONTINUE 


FONT IMIJE 


OF DU ‘EM = 

1 . D+00 

DFDL Lo,2 = 

1 .D + 00 

DFDU ( n ,3i = 

1 . D+00 

T - ’ I FLAG. E 

0.2) THEN 

f.n o:> II 

= 1,3 

DO • 

IT r ■ , 3 

D2LDU2 > JJ , II ) - 0 

DO 

30 MM = 1,7 


P2FDU2 ' MM , I T , J. 

3 r 

JNTINUE 

CONTI 

NUE 

CONTINUE 

END I F 


END IF 


T AGE'S i THROUGH < N 

!- i . ASSUME MF 

ELSE 1 7 ■ f . LE . f . N- 

■in then 

L - 


F ( 1 i - x ( 3 ) * * 

7 > 

F < 2 ) -- X ( 4 > * X 

( 7 ) 

F »' 3 :i X f 0 ) * X < 

7 ) 

r ; .4 1 “ >' < 6 ■> * X 

( 7 > 

F i 5 - 1. > v 1 > * X ; 

7 ) 

F i, c- 1 - U(2'>+X 

(7) 

rr ) ~ • : . D«- 0 O 


IF I FLAG . GE . 

1 ) THEN 


■ 1 , 7 

DLDv. i I J 

) = 0 . D + 00 

DO 1 70 , 

J J - 1,7 

DFDX 

*; 1 1 , j j •, =.■= o.d+ox 

CONTINUE 

Du 1 1 0 ' 

T J = 1,2 

OF DU ( I I * J J > = 0 . D+OX 

CONTINUE 

CONTINUE 


nn j tp'j t j 

;= 1 , 2 : 

DLDU ’ I I 

') = O.D+OO 

FONT I NUE 


DL D X : 7 ) = 

1 .D+00 

DFDX ■ 1 , 3 ’> 

= X ( 7 

DFDX ' 1 ,7 j 

=7 v i 3 ) 

DFDX ’ 2 , *+ > 

= X ( 7 > 

npn ■ 3 T 7 

>.: *; 4 ) 

DFDX '3,5 > 

= V: \ F ) 

DFDX i 3,-?' 

= X < 5 > 

DFDX = + ,S » 

= x ■; 7 ) 

DFDX ■ h , 0 ; 

“ X ( 6 > 

D“DU ’’5,1 i 

- X ■ 7 'i 

DFD x f ' 5 , 7 .■ 

-- IJ ' 1 ) 

"pr. t j Ci ^ r j 

- X ^ ■> 


N'x 


AND NU = 2. 


OWBgg; 

Of fO° R W 
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o n n 


L 1 F D X k <:.• , / ) — U d . : 

IF t I FLAG . EQ . 2 ) THEN 
DG 250 II = 1 ,7 

DO 240 JJ =■ 1,7 

D2LDX2 ; I 1 . J .1 - O.D+0 1 '- 

DO 230 MM - i , 

D2FD V £ ■ I I , J J . MM > =■ ■ ■ . D + 0 
230 CONTINUE 

DO 235 Mm = l , £ 

02 FD X U ( I I , J J , MM ) ~ 0 . D+00 
235 CONTINUE 

240 CONTINUE 

DO 245 J j = 1,2 

D2LDXU t I I , JJ 3 = 0 . D+00 
DO 243 MM = 1,2 

D2FDU2 f I I , ,T J , MM > = 0 . D+04 
243 CONTINUE 

245 CONTINUE 

250 CONTINUE 

DO 274 II = 1,2 

DO £60 JO ~ 1 ,2 

D2LDU2ai,JJi = O.D+OO 
260 CONTINUE 

270 CONTINUE 


D2FDX2 ( 1,3,"" 

~ 

1 . D-i-Mi.i 

D2FDX2 { 1 ,7,3 : 

- 

1 . D+00 

P2FDX2 ''2.4,7; 

- 

1 . D+0" 

D2FDX2 (2,7,4 j 

-■ 

1 . D+00 

D2FDX2 ( 3 , 5 , 7 i 

= 

J . D+00 

DHFDX2 (3,7,5) 

= 

1 . D+00 

D2FDX2 i 4 , 6 , 7 j 

= 

1 . D+00 

D2FDX2 (4,7,6) 

= 

1 . D+00 

D2FDXU (5,7, 1 ) 

= 

1 . D+OO 

D2FDXIJ (6,7 , 2 ) 

=. 

1 . D+00 


END IF 
END IF 

DO STAGE N. ASSUME NF = 0, N> = 7, AND NU = o. 


44" 


454: 


ELSE IF ' \ . ED . N "j THEN 
[. = o , D+00 
IF ;iFLAG.GE.l) THEN 
DO 320 II = 1,7 

DLDXill) » O.D+OO 
CONTINUE 

IF I FLAG. EC. 2- THEN 
DO 450 II = 1,7 
DO 440 JD ~ 1 .7 

D2LDX2-: I I , JJ t 
CONTINUE 
CONTINUE 
END IF 
END IF 
END J F 
RETURN 
END 


0 . D+00 


C 

c 

C THIS SUBROUTINE DEFINES THE AUXILIARY' CONSTRAINT- FOR 
C M I N I MUM — T T ME — TO -THE-OR I G I N PROBLEM . 

C 

SUBPOUT I NE CMNTM t , U - T , 4 , I FLAG , NX , NU , NC , C , DCD X • 

1 DCDU , D2CDX2 , D2CDXU , D2CDU2 

I MPL 1 C I T PE Ai . *P < A-H , 0-Z ■ 

I MPL. I C 1 7 I NTEGEF +4 < r -N i 
PARAMETER N=--31 - 


THE 


32 — ST AGE 
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I !--E' i£ I ON r < hi X j , J C NU > « C ■ NC •■ , 

DCDV < NC , h4 X > , DC DU * NC . NU ) , D2CDX2 ( NC , t ! ' .N? » - 
DEC D XU NC , N * , NU ) % DEC DUE ( t iC , NU , NU ) 

A-'E 

FGP -SHE CEROTH STAGE, WHICH Only SETS UP THE If! 

C THESE 1= Oi -t. ‘ A>- AUX IL I AR V : MEGUAL IT- COmSTF-'A I f 

C MAGNITUDE. ASSUME UC - E , N 4 = , AND Nil - 3 

I r , . .EC'.O > THEM 

r < j ' = - ij < 3 : + . 0 1 D+OO 

JF ■' TFLAG.CE. 1 » THE! * 

DC DU ''1,1 ■ = 2 . D+ 00*IJ ( 1 ) 

DCDU ( - ,2 < - 2 • D+OO* U ( 2 ) 

DC DU ( 1,3) ' O.D+OO 

DCDU (2,1' - 0 - D+OO 

DC DU (2,2 i = O.D+OO 
DCDU (2,3 ' = -1 * D+OO 
IF I FLAG. ED. £> THEN 
DO 30 II ~ .1,2 
jO 20 JJ - 1,3 

r.Q 10 IT- = 1,3 

DEC DUE ■ T I , J J , I T ’= = 0 D +■' 

1 -..- CONTINUE 

ZiO CONTINUE 

30 CONTINUE 

DC CD LIE ( 1,1,1 ~ E - D + OO 

D2CDU2 ( 1 , E , E > -- 2 . D+OO 
END IF 
END IF 
C 

C on STAGE-. 1 THROUGH 'N-l). ASSUME NC = E, NX = ~ - AND NU = 2. 

L IN ADDITION TO LIMITING THE ACCELERATION MAGNITUDE, LIMIT THE 
c AMOUNT OF REAL TIME IN THE ENTIRE TRAJECTORY TO NO LESS THAN .01 SEC. 




ELEE IF < C . LE . ( N- 1 i ) THEN 

C ( 1 > -- >: ( 5 ) *-*2 + X \ 6 ) + *2 - 1 » D+OO 
C ( 2 • - •• i ( 7 > + . 0 1 D + OO 

IF IFLHG . GE - 1 > THEN 
DO 120 II = 1,2 

DC 1 00 JJ - 1 ,7 

LCD u I I , 1 T 1 = O.D+OO 
13 0 CONTINUE 

DO 11 j J.J ~ 1,2 

DCDU ( I I , J J - “ 0 . D + OO 

1 i o CONTINUE 

120 CONTINUE 

DCD * \ i , 0 ) = 2 . D+00*X ( 5 ) 

DCD X <!,£>.* - 2 . D+OO* X ( 6 > 

DCDx < 2 , 7 1 ~ - 1 . D+OO 
IF i I FLAG . EO . 2 ; THEN 
DO ECO II = 1,2 

DO 240 JJ = 1,7 

DO E30 MM - 1,2 

D 2 CDX 2 < II , J J , MM ) = O.D+OO 
£ 3 >;« CONTINUE 

DO 235 MM = 1,2 

D 2 CD XU (II, J J , MM ) = 0 . D+» >0 
235 CONTINUE 

2 \ r- CONTINUE 

DU 2^5 JJ - 1 , 2 
DO £43 MM = 1,2 

DEC DUE (IT, J J , MM - — J . D+OO 
342 CONTINUE 

F^" CONTINUE 
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n n n 


250 CONTINUE 

D P 0 D X £ < 1,5-5) = £ - D + 0 O 
D2CD XE < l,o- 6 - = £ . 0+00 

PUD ip- 

FUi;» if 
C 

r DO STAGE f . ASSUME NL = 5, N* ~ 7. AND UU = v. 

NOTE THAT THIS IS THE ONLY STAGE WITH EQUALITY CD! I5TF- A I ’PS . 

THE riEST CONSTRAINTS ARE EQUALITY CONSTRAINTS. 

ELSE IF U-.EQ.U> THEN 

c < i ■ = x ■; i 

C<E> =■ * < 2 > 

C ( 1 -- X < -V j 

L ! 5 ) v- y f 5 j **£ + x ‘ o ) -- 1 . D+00 

TP IFLAG.GE.1U THEN 

ISO 320 II = 1,5 

DO 300 ,JT = 1,7 

OCD X «■ 1 1 , JT > = 0 . D + 00 
CONTINUE 
CONTINUE 

DCDX >: 1 • 1 1 - 1.0+00 

LCD X > 3,3- = 3 -D+00 
DCDX ( 4 * h i - 1 . D+VU 
D CD X i 5 , 5 > - 2 . D+00 * X < 5 ) 

DCD X (5,6) = £ . D+00* X < 6 ) 

IF < J.FLAG • EQ • 2 ) THEN 
DO 350 II = 1 ,5 

DO 340 JJ = 1,7 
DO 330 MM = 1,7 

D2CDX2 < 1 1 , J J , MM 11 = 0 . D+00 
CONTINUE 
CO NT I NUE 
CONTINUE 

D2CDX2 \ 5 ,5,5 » = £ . D+00 
D2CD xat5,6,oi - 2. - D+00 
END IF 
END IF 
END IF 
RETURN 
END 

C THIS =.«JB R ni IT INF INITIALIZES VECTORS NEEDED FOR PARIS- MAIN 0 C 'T T M I ■ IvM 

c ALGORITHM". NOTE that THIS IS FOR a 32 -STAGE PROBLEM. STAGES 0 THROUGH 21. 
C IN OTHER WORDS N = 31 IS ASSUMED. 

C ONE CAN EASILY SWITCH THESE THREE SUBROUTINES TO USE A DIFFERENT NUMBER 
C OF PARAMETERS BY SIMPLY CHANGING N IN THE THREE PARAMETER STATEMENTS 
r THAT OCCUR NEAR: THE BEGINNING OF EACH OF THE THREE SUBROUTINES . 

C 

C THIS SUBROUTINE SHOULD GET CALLED BY THE MAIN PROGRAM BEFORE I 
C CALLS 7 Hi p OPTIMIZATION ALGORITHM. THE MAIN PROGRAM MUST BE 
C SURE * TO MATCH DIMENSIONS WITH THE F I X ED- DIMENSIONING INFORMATION 
C OF THIS SUBROUTINE’S ARRAY ARGUMENTS. ALL OF THE ARGUMENTS OF 
C THIS SUBROUTINE ARE OUTPUT ARGUMENTS. 

C NOTE THAT THE MAIN PROGRAM WILL HAVE TO INCLUDE THE STATEMENT 
C E X TERNAL FLMNTM , CMNTM 

C IN ORDER TO REFER. TO THE ABOVE TWO PROBLEM MODELLING FUNCTIONS. 

0 THESE TWO FUNCTIONS REPLACE THE DUMMY FUNCTIONS FLFNCT AND CFNCT 
r AS T* NR UTS FROM THE MAIN PROGRAM TO THE MAIN OPTIMIZATION SUBROUTINE. 


300 

320 


330 

340 

350 
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5U3RDUT 1 NS NNTM I N I NUVEC - NX "EC . NECVEC , N I C VEC - NUN,-, * - N X MA , NOMA* • 
UO , XO , I DT , t . KRF , I 3ECGR , DELTU , DELT X > 

: MPl. I C L T F EAL*8 - A-H . O-Z < 

I !’•'[ F' l I C T T IN T E b E R * ^ < : - N * 

PARAMETER N=3* ; 

D I MENS lOt! NUVEC ■: 0:N> , NX VEC \ 0 : N » , NECVEC \ 0 : N > , NI CVEC < 0 : M > , 

Jo \ j. r 3 . 0 : N > * XO i 1 : 7 « 0 : N i * IDT \ 0 : N ; T \ n : N > - 

!•: RE-' • 1 : E - 0 : < M- i . = , I SECGR < ’ ■ : N «. DELTU - j : 3 , V : N > ■ 


' I *-tL I EE D I MENS I ON ING INFORMATION AND OTHER PR' 
L GNTROL 1 1 'FORMAT I ON . 


DLEM MO I 


NOVEL \ - 
NX' 'EC 1 ’ 
NECVEC' 


I dt . o i - i 
r-o- = v . d 
DEL 'I =■ i.D- 
TDUM - O.D 
fry i , o 


■ DBl-E •: FLOAT •. r-4— i 


NUVEC ( I > = 8 
NXVEC<I> --- 7 
NECVEC' I; = A 
N I CVEC (I; - 8 
I DT ■' H - O 
T < I > -- TDUM 
TDUM = TDUM ■+ DELT 
K R 1.41 n 1 ^ 1 

MRK < 2 * I i = 1 
I SECGR < I j - 0 
CONTI NUE 
NUVEC \N> = O 
NXVEC(N) = 7 
NECVEC -■ N 4 = h 
NICVEC<N) = 1 
IDT (Ni = 1 
T<N i -- TDUM 
I SECGR .ON .■ 

MU MAX - 3 
NX MAX - 7 
tiCMAX = 5 


T. , THERE IS REALLY NO NEED TO SRECIFY THE ( 
JEC'O",, ~ ■>, THE SUBROUTINE FLMNTM ESSENTIA 

:tial condition. 


■ ' S INITIAL CONDITIONS 
SPECIFIES THE STATE'S 


INITIALIZE GUESS OR THE SOLUTION. THIS IS A POOR 
FIRST GUESS , IT ONLY BRINGS THE VELOCITY TO ZERO - 
IT DOES NOT BRING THE POSITION TO ZERO. 


HA r ; 4 i - -~3 . 6053B255 l 4G739D+00 
V. ;:> i a , 3 "> - -35 . i 25 LBS ] 542 I 742D+00 
o : r , i ) - R „ 94*01834 *0 2 S 1 4 7 c + o f t 
s ! 4 - I ■ ” . 43 0 9693 1 3 4 7326 D + 0 O 

U 3GUS S S -■ C 8 O R T i Y 0 < 3 * 1 f* * 2 + X ■ :> < 4 
U 1. GUESS - - X < 3 , 1 J U 3GUESS 
J2GUESS - - *0 ( 4 i i > /U3GUESS 
LLSM i - Li 1 GUESS 

X r - ! , o-l ' “ LiBGUEBc 
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non n n n 


J i j 

uo 



I t,', 


, , J i — SJJOUCi 

1,0) = Ml GUESS 

5 , 0 > - ME GUESS 


U'V'Snl ' 1 " D+OC 
DTREhL - DEL T < LiSGUESS 
TPEAL 0.D+O0 
DO 3*J Li = E , N 

IF ‘KK.uT.N* THEN 

IJiV 1 ,!:K ! = O.D+UO 

!JO\d,fiJ = O.D+OO 
END IF 

TREAL- TREAL. •* DTREAL 


"? 0 >; i , Li ) = XO ( 1 - 1 
s; 0 (E,[i' = XO(£,i 
XCM3N • i - X0(3.1 


+ TREhL* X tj i [ 
x O 5,1 ■ 

+ T R E A L ■* * ' ( • 

• J TREAL* XO •' ' 
* TREAL. * XO - 


yo< 5 ,L.i j 
xo *i u * l:l > 

V n - " , Li , 
pnv-iT ; \'HJE 


initia_i: 


DEL 


X 0 < 5 , 1 J 
X0<6« 1 


AND DEL 


FOR THE SALE 


AO 


SO 


DO 60 L'L = 0*N 

DO AO JJ = 1?7 

DELTX ( J J , K K ) - . 0 00 1 D +■ 0 0 

continue 

DO 50 JJ = 1,3 

DEL.TU *•' J J - M K ) =■ . 0 0 0 1 D + 0 O 
CONTINUE 
CONTINUE 


a d j_ ' -+ ■ ( i 1 1 T R EhL**=! ) ^ 

4- . '5 D i lOaRE AL * * 2 ) * 


CHECH IMG DER I '-'AT 1 VEw 


NORMAL POINT OF RETURN TO THE CALLING MAIN PROGRAM. 


RETURN 

END 




I 
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A multi-dimensional cubic spline formlation. 
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A Multi-Dimensional Cubic Spline Formulation 

This appendix describes a relatively simple way to generate a cubic spline interpolation of a 
single dependent variable as a function of multiple independent variables. It works when the data 
is laid out on a rectangular gird in the space of the independent variables. The grid spacing for any 
one variable need not be uniform. This is useful to trajectory optimization because cubic splines 
are continuous with continuous first and second partial derivatives. Continuity of a function and 
its first and second partial derivatives is a must to ensure convergence of most second-order 
optimization algorithms. The present spline procedure is intended to offer rapid on-line 
computation of the spline function and its partial derivatives given that some off-line calculations of 
spline coefficients have been performed. 

The mathematical problem at hand is to generate a function y(xj,x2,x3, ..., x n ) given the 
values of y at gird points. Suppose that the rectangular gird points are (xj- ,X2:„,X3 ; ,, ...,x ni ) 

w Jl J2 J3 Jn 

where ji = 1 ki, for all i = 1, .... n. Suppose, also, that xjj < xij +1 for all i = 1, ..., n and j = 

1, ..., (kj-1), and define Axjj a xjj +1 - xjj. Note that the values of Axij may vary with the index j 

for fixed i, but they will all be positive. Let the data for the function values at the grid points be 
denoted as 

yjld2J3,...dn = y( X ljl’ X 2j 2 ’ X 3j 3 ’ *"’ Xn j n ) ( E *D 

A cubic interpolation function can be developed that requires only the following data at the 
grid points: y, dy/dxj for i = 1, ..., n, 3 2 y/3xi3xj for i,j = 1,..., n, i * j, S^y/dxiSxjdxk for i,j,k = 
1,..., n, i * j * k, ..., 8 n y/3xi3x23x 3 ...3x n . In other words, the functions value is needed along 
with all of the function's partial derivatives that differentiate at most one time with respect to any 
given independent variable. For n = 1 independent variable, the required data is y and 3y/3xi at 
each grid point. For n = 2 independent variables, the required data is y, 8y/3xi, 3y/3x2, and 
3 2 y/3xi3x2 at each grid point. For n = 3 independent variables, the required data is y, 3y/3xi, 
3y/3x2, 8y/8x3, 3 2 y/3xi3x2, 3 2 y/8xi8x3, 3 2 y/8x 2 3x3, and 3 3 y/8xi8x23x3 at each grid point, etc. 

E2 
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The cubic interpolation function uses 4 special basis functions 


glCx) = Cc- l) 2 (2x + 1) (E-2a) 

g 2 (x) = x 2 (-2x + 3) (E-2b) 

g3(x) = (x-\) 2 x (E-2c) 

g4(*) = (x-l)* 2 (E-2d) 


These functions have the special properties: gi(0) = 1 and gi(l) = dgi/dbclg = dgi/dlxl j = 0; 
g 2 (l) = 1 and g 2 ( 0 ) = dg 2 /dxl 0 = dg 2 /drl 1 = 0; dg 3 /dxl 0 = 1 and g 3 ( 0 ) = g 3 (l) = dg 3 /<±cl! = 0; 
dg 4 /dxl] = 1 and g4(0) = g 4 (l) = dg 4 /dxlo = 0. These special properties permit the writing of a 
cubic interpolation directly as a weighted sum of products of these functions, and the weighting 
factors are simply the function and its partial derivatives at the node points. 

To see how the function y(xi,x 2 ,x 3 , x n ) is evaluated it is best to give the explicit formula 
for the 2-dimensional cases because a general formula applicable to all dimensions gets very 
complicated. After presenting the two-dimensional formula, the generalization to higher 
dimensions can be discussed. 

For the 2-dimensional case, suppose that point (xi,X 2 ) falls in the gird box <Ji02). That is. 
suppose that xjj. < x, < Xij. +1 for i = 1,2. Also, define Xi = (xj - xij.)/Axjj. for i = 1,2. Then the 

cubic interpolation formula is 
y(*l,X2) = yji,j 2 gl(*l)gl(*2) + yji+l,j 2 g2(*l)gl(*2> 

+ yjl,J 2 +l gl(*l)g 2 (* 2 ) + yji+l,j 2 +l g 2 (*l)g 2 ta) 

+ ^rUia Axi ii + ^rU..i2 Axi ii samite) 

+ Ax 'il g3 <**> g2 < X2 > + Ax, il 84Ul)g2&2) 

+ l£|jlj2 AX2 i2 6 1<Xl)g3(X2) + I&U 1 .J 2 AX2 i2 « 2(Xl)g3(jI2) 

+ ^|jl.J 2 -l Ax2 i 2 gl(xi)g4( * 2) + Ax 2j2 g2Ul)g4ta) 

+ 3^?|il,j2 AX1 il AX2 i2 g3(;Cl)g3(X2) 

+ J2 AX1 J1 AX2 J2 S4(X1)83(X2) 



+ slraii,*-! Ax 'ji Ax2 j2 ®<*i)**(*2) 

+ 3^ J,+l.j2*l AX1 il AX2 i2 < E -3) 

Based on the properties of the gi(jc) functions and the definition of X[ for i = 1,2, it is straight- 
forward to confirm that y(xi,x 2 ) and its partial derivatives take on the values assigned to these 
quantities at the grid points. It is also straight-forward to prove that the y(xi,x2) cubic 
interpolation function and its first partial derivatives are continuous everywhere, even at the 
boundaries between grid boxes. 

To understand how the formula would generalize to n independent variables, consider the 

individual terms in eq. (E-3). First notice that there are 16 terms. There are 4 n terms in the general 
expression for an n-dimensional spline. Each term is of the form 

C # gi 1 (xi)*gi 2 (x 2 )»gi 3 Cx 3 )*...*gi n Cxn) for ii, i 2 , 13 . in = 1,2,3,4. The constant C is a function of 

the indices ii, i2, i3, ..., in- In general, if ik = 1 or 2, then the expression for C will not include 

partial differentiation with respect to xk, but if ik = 3 or 4, then the expression for C will include 

9 

the operator (Axkj k ^-) operating on y. Additionally, if ik = 1 or 3, then C will be a value 

associated with a node with the index jk, but if ik = 2 or 4, then C will be a value associated with a 
node with the index jk+1 - These rules completely determine the function. 

The partial derivative data values at the grid points can be chosen arbitrarily and the function 
and its first derivatives will be continuous, but the second derivatives will normally be 
discontinuous at the gird box boundaries. 

With care, the grid-point partial derivative values can be chosen to yield continuity of the 
second derivatives also. A 2-dimensional example is convenient for understanding how to do this. 
Suppose that 3y/3xi is assigned arbitrarily only at the extreme grid points in the xj direction, at 
grid points (xij^j), (xi p X2 2 ), (xi 1 ,X2 3 ), ..., (xi lt X2 k2 ) and at (xi kj ,x 2 1 ), (xi kl ,x2 2 ), 
(xi kl ,x 2 3 ), ..., ( x lk 1 > x 2k 2 )- Suppose also that 3y/3x2 is assigned arbitrarily only at the extreme 
grid points in the X 2 direction, at grid points (xi 1 ,x 2 1 ), (xi 2 ,x 2 j), (xi 3 ,x 2 j), .... (xi kl ,x 2 1 ) and at 
( x lpX 2 k2 ), (xi 2 ,X 2 k2 ), (xi 3 ,x 2 k2 ), ..., (xi kl ,x 2 k2 ). Also assume that 3 2 y/3xi3x2 is arbitrarily 
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defined only at the 4 extreme comer points of the 2-dimensional grid, (x^^j), (xi kl ,x 2 1 ), 
(xi 1 ,x 2 k2 ), and (xi kl ,x 2 k2 ). This situation is illustrated in Fig. E-l. 

These boundary partial derivatives can be used to define the interior point partial derivatives 
through a sequence of 1 -dimensional splines. Suppose that z(w) is a function of just one 
independent variable, w. Suppose also that z is specified on a 1 -dimensional w grid that consists 
of k points: z\ = z(wi), Z 2 = z(w 2 ), Z 3 = z(w 3 ), ..., z^ = z(wj c ). Then, knowledge of 3z/9wli and 
3 z/ 3 wlk is sufficient to uniquely determine a cubic spline that passes through all of the data points 
and whose second derivatives are everywhere continuous. This is a standard result of 1- 
dimensional spline theory [ 6 ]. This standard cubic spline has well-defined first derivatives at the 
interior grid points, 3 z/ 3 wl 2 , 3 z/ 9 wl 3 , 9 z/ 3 wl 4 , ..., 9z/9wlk-i. These are exactly the remaining data 
that would be needed to implement the 1 -dimensional counterpart to eq. (E-3). 

For the 2-dimensional case this process is generalized to determine the required interior- 
point partial derivatives. First, the values of 3y/3xi at interior grid points can be determined by 

applying the 1 -dimensional technique for computing interior derivatives to each set of grid points 
with a fixed value of X 2 - This is done k 2 times, once for each for X 2 e {x 2 j, X 2 2 , X 2 3 , ..., X 2 k2 ). 

In a similar fashion, the values of 3 y/ 9 x 2 at interior grid points can be determined by applying the 
1-dimensional technique ki times, once for each xi e {xi lt xj 2 , xi 3 , ..., xi kl ). At the end of 

these two sets of 1 -dimensional spline operations, 3y/9xi and dy/dx.2 are available at every grid 
point, but 3 2 y/ 9 xi 3 x 2 is still available only at the 4 comer grid points. 

The first step in determining 3 2 y/dx] 3 x 2 at the interior points is to determine it at the xij and 
xi kj extremities of the grid. This can be done by using z = 9y/3xi and w = X 2 in a 1-dimensional 
spline operation. For the two cases, xi = x 1 1 and xj = xi kj , z is available at all of the w grid 
points (w = X 2 !, X 2 2 , X 2 3 , .... X 2 k2 ) and 9z/9w is available at the w endpoints (w = \2\ and X 2 k2 ). 

Therefore, the spline operation can be used to determine 9z/3w at the interior w points (w = X 2 2 , 
X 2 3 , ..., X 2 k2 _ 1 , but 9z/9w = 9 2 y/ 3 xi 3 x 2 according to the current definition of z; so, at the end of 

this operation 3 2 y/ 3 xi 9 x 2 will be available at the points marked on Fig. E-2. 



The final step in determining d^y/dx\dx 2 at the interior points is to perform a series of 1- 
dimensional splines of z = dy/dx 2 in the w = xj direction, one for each grid value of X 2 . This can 
be done because in each case, z is available at all of the interior points and 9z/3w is available at the 
w end points. The resulting values of 9z/9w at the interior points are just the interior values of 
9 2 y/9xi9x2 that are needed in eq. (E-3) to compute y(xi,x 2 ). Note that the roles of xi and X 2 can 
be reversed in the process of determining the interior-point values of 9 2 y/9xi9x2 without affecting 
the final results. 

All of the 1 -dimensional spline operations that are needed to get the interior partial derivatives 
can be computed off-line, and the resulting grid-point partial derivatives can be stored for rapid on- 
line evaluation of y(xi,x 2 ) via eq (E-3). The total number of off-line 1-dimensional splines 
required to compute all of the interior-point partial derivatives is 2 + ki + 2 k 2 , which is relatively 
inexpensive. 

The boundary-point partial derivatives defined on Fig. E-l are still arbitrarily selectable. In 
practice, the function y(xi,x 2 ) is insensitive to the selected boundary partial derivatives at points far 
in the interior of the grid. In grid boxes near the boundary the function becomes sensitive to these 
arbitrary quantities. A useful approach to selecting these quantities is to compute finite difference 
approximations of these quantities near the boundary and extrapolate the required quantities to the 
boundary. The required finite-difference approximations can be computed from the y data values 
at the grid points. Figure E-3 illustrates the interior points at which various finite-difference partial 
derivative approximations get computed in order to extrapolate a particular partial derivative to a 
particular boundary point. This method of determining the arbitrary boundary partial derivatives 
has been used in all the calculations of this report. 

There is a straight-forward, but tedious generalization of this method to the problem of 
computing interior-point partial derivatives for higher-dimensional splines. In the interests of 
brevity, higher-dimensional generalizations have been omitted. 
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Fig. E-l. Boundary points on a 2-dimensional spline's grid at which 
partial derivatives may be assigned arbitrarily. 
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Fig. E-2. Propagation of the second cross partial derivatives along two edges 

of the grid from the four corners. 
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% Example boundary point where dy/dx^ is calculated by linear extrapolation from two 
interior points 

© Corresponding interior points where finite-difference values of 9y/9x l are available 


I Example boundary point where 3y/9x 2 is calculated by linear extrapolation from two 
interior points 

^ Corresponding interior points where finite-difference values of 3y/9x 2 are available 

▼ Example boundary point where d y/dx 1 dx 2 is calculated by bi-linear extrapolation from 
four interior points 

2 

V Corresponding interior points where finite-difference values of d y/3x ] 9x 2 are available 


Fig. E-3. Examples of how partial derivatives on the boundary are 
extrapolated from finite-difference interior values. 
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